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Telephone Sample Designs for the 
U.S. Black Household Population! 


KATHRYN M. INGLIS, ROBERT M. GROVES, and STEVEN G. HEERINGA2 


ABSTRACT 


The two-stage rejection rule telephone sample design described by Waksberg (1978) is modified to im- 
prove the efficiency of telephone surveys of the U.S. Black population. Experimental tests of sample 
design alternatives demonstrate that: a) use of rough stratification based on telephone exchange names 
and states; b) use of large cluster definitions (200 and 400 consecutive numbers) at the first stage; and 
c) rejection rules based on racial status of the household combine to offer improvements in the relative 
precision of a sample, given fixed resources. Cost and error models are examined to simulate design 
alternatives. 


KEY WORDS: RDD samples; Telephone surveys; Rare population samples. 


1. INTRODUCTION 


Surveys of rare populations lacking special frames often entail large per-unit costs relative 
to similar designs for the full population. When the rare population is a small subgroup of 
a readily identifiable population, the sample of that subgroup is often obtained by screening 
the larger population. Household surveys of demographic subgroups such as the U.S. Black 
population typically use such screening to locate eligible sample units; however, extensive 
screening to identify a rare population sample results in high costs per interview. In recent 
years telephone-sampling methods have been proposed as cost-efficient tools for sampling 
and interviewing rare populations. The cost of telephone interviewing is often less than face- 
to-face interviewing (Groves and Kahn 1979), and when screening is required to identify an 
eligible respondent, the cost-efficiency of telephone interviewing becomes even more marked. 
Still, the screening costs of telephone surveys of rare populations can be high in absolute terms. 

This paper presents ways in which the screening method for telephone surveys can be refined 
to reduce costs while achieving desired levels of precision. In this paper we examine a variety 
of telephone sample designs for the U.S. Black household population. The telephone survey 
experiments described in this paper were conducted as part of a study of Black political 
attitudes and electoral behavior in the 1984 U.S. presidential election. 

The use of telephone sampling and interviewing implies that Blacks living in households 
without telephones (about 15 percent of the U.S. Black household population) are not covered 
by the survey procedures. Such persons tend to be poorer and younger than those living in 
households with telephones (Thornberry and Massey 1983). To the extent that Blacks without 
telephones have attitudes and voting behaviors that are different from those with telephones, 
the survey estimates would differ from Black household population parameters. While not 
wanting to discount noncoverage error associated with telephone surveys of the Black popula- 
tion, this paper focuses on differential cost efficiencies and sampling error that might result 
from alternative approaches to telephone samples of Black households. 


' Revision of paper presented at the 1985 American Statistical Association meetings. Research was partially sup- 
ported by the U.S. Bureau of the Census and the Survey Research Center. The discussion does not necessarily 
represent the views of those organizations. 


: Kathryn M. Inglis, McNair Anderson and Associates, Australia. Robert M. Groves and Steven G. Heeringa, Survey 
Research Center, University of Michigan, Ann Arbor, Michigan, 48106-1248, United States. 


2 Inglis, Groves and Heeringa: Telephone Sample Designs 


The telephone sample designs presented here are extensions of a design described by 
Waksberg (1978). That random digit dialing (RDD) design (commonly referred to as the 
Waksberg-Mitofsky design) is a two-stage cluster sample of telephone numbers. U.S. telephone 
numbers contain 10 digits, a three-digit area code, a three-digit central office code or ‘‘prefix’’, 
and a four-digit suffix in the range 0000-9999 (e.g., 313-764-4424). At the primary stage, 
a stratified sample of 10-digit telephone numbers is randomly generated, and each such 
‘‘primary number’? is linked to a block of 100 consecutive numbers (e.g., 313-764-4424 would 
be linked to the ‘‘100-series’’, 313-764-4400 to 313-764-4499). For household surveys, if the 
primary number is found to be a working household number, then its cluster of 100 con- 
secutive telephone numbers is retained at the first stage for further sampling. If not, its 
‘*100-series’’ is discarded. Therefore, the probability of selection of a first stage 100-series 
is proportional to the number of working household numbers in that 100-series. In the se- 
cond stage of sampling, equal numbers of working household numbers are selected from 
each of the 100-series retained at the primary stage. Therefore, the second stage sampling 
of households is performed with conditional probabilities of selection inversely proportional 
to the number of working household numbers in the 100-series. Thus, the design yields an 
equal probability (epsem) sample of household numbers, and clusters them so that the pro- 
portion of total numbers selected which reach households is higher than that obtained by 
a stratified random RDD sample. To clarify the discussion here, we refer to the 100-series 
banks of consecutive numbers as the primary stage unit (PSU) of the two-stage RDD design. 
The term ‘‘cluster’’ is reserved for the fixed set of working household numbers that is selected 
from the PSUs at the design’s second stage. 

In this research the sample design modifications aimed at reducing screening costs take 
three forms: a) stratification of telephone exchange units by proportion Black, and dispropor- 
tionate allocation of the sample to high density Black strata; b) use of two-stage rejection 
rules based on both residential status and race of the household; and c) increase in PSU size 
(from 100 consecutive numbers to 200 and 400). 

Stratification of the telephone population by race attempts to isolate exchange areas with 
high proportions of telephone subscribers who are Black. Higher sampling fractions are then 
applied to those strata, relative to strata with lower proportions Black. Under this dispropor- 
tionate sample design, the total number of households that have to be contacted in order 
to obtain one interview with an eligible Black household is smaller than that for an epsem 
sample of the household population. Consequently, the screening costs for locating a sam- 
ple of Black households are reduced. In telephone samples, the basic geographical unit for 
stratification is the wire center or telephone exchange, to which one or more three-digit prefixes 
(central office codes) may be assigned. In general, no counts of the subscriber population 
by racial characteristics are available for these sampling units. Thus, proxy indicators of high 
density Black exchanges must be used. The experiments described in this paper examined 
the value of such proxy indicators. 

Blair and Czaja (1982) present an alteration of the Waksberg-Mitofsky RDD design which 
incorporates two-stage rejection rules based on both residential status and race eligibility 
of the household. For the Black population this method includes, at the first stage, only 
100-series whose primary number was assigned to a Black household and then samples a 
fixed total of Black household numbers within those PSUs. In a U.S. national sample survey, 
Blair and Czaja found that using this design, the percentage of Black households among 
all household numbers chosen increased from 9 percent for the first stage to 25 percent for 
the second stage numbers. Given the compensating probabilities of selection in the two stages, 
this epsem design greatly reduces the level of screening required to obtain any given sample 
size of Black households. A similar alteration of the rejection rules for the two-stage Waksberg- 
Mitofsky design was employed in the experiments described in this paper. 
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In the Blair and Czaja design some of the primary stage 100-series contained too few 
Black household numbers to yield the number of elements per cluster required (10 in their 
case) for an epsem sample of Black households. In addition, relatively large screening costs 
are incurred at the first stage of selection for this design; over 44 primary numbers must 
be dialed to locate one Black household. The joint solution to these two problems is to both 
increase the size of the PSU and to select larger numbers of second stage elements per PSU. 
The analyses reported here examined the use of primary stage units of 100, 200, and 400 
consecutive numbers each. The extension of the PSU definition beyond the standard 100 
consecutive numbers was suggested by observations on the assignment of telephone numbers 
within prefixes. The following appears to be the most common pattern: 1) almost all household 
numbers within a prefix serve units located within the geographical boundaries of the ex- 
change; 2) there is little geographical clustering of assignments within exchanges (i.e., neighbors 
do not tend to have consecutive telephone numbers, nor need they have numbers in the same 
prefix); and 3) there is more diversity in the percentage of household numbers among 
1000-series than among 100-series within the same 1000-series of numbers. These impres- 
sions are the result of several years of household telephone sampling at the Survey Research 
Center. Observations 1) to 3) suggest that the expansion of the PSU definition from 100 
consecutive numbers to a larger number might permit the use of larger clusters of secondary 
numbers with little reduction in the proportion of those numbers which are Black households. 


2. THE PILOT STUDY 


In two integrated experiments imbedded in a pilot survey, several design alternatives were 
tested. One purpose of the pilot study was to examine the ability of stratification based on 
civil government units, with only rough correspondence to telephone exchanges, to isolate 
sets of telephone numbers densely filled with black household numbers. For this, three strata 
of exchanges were defined: 


1. ‘‘High density’’- Exchanges corresponding to the central cities of large Standard 
Metropolitan Statistical Areas (e.g., Chicago city, for the Chicago SMSA). This iden- 
tification was based on the name of the telephone exchanges in these areas. 

2. ‘‘Medium density’’- All other exchanges in selected southern states (Virginia, North 
Carolina, South Carolina, Florida, Georgia, Alabama, Mississippi, Louisiana). The 
vast majority of exchanges lie in only one state; those serving two states were associated 
with the state given in the exchange name. 

3. ‘‘Low density’’- The balance of exchanges in the coterminous United States. 


An equal probability sample of 1400 six-digit area code/central office code prefix combina- 
tions was then systematically selected from the 34,389 such combinations listed as active on 
a frame which can be purchased from American Telephone & Telegraph (AT&T). Four-digit 
random numbers were appended to each selected six-digit stem to yield a sample of 1400 
ten-digit primary numbers. 

The results of the pilot study demonstrated that the three strata had vastly different pro- 
portions of Black telephone numbers. The low density stratum was found to require over 
six times as much screening to locate a black household as was required in the high density 
stratum. (This result was confirmed with more precision in the production study, discussed 
in the next section). 

Another purpose of the pilot study was to test the use of rejection rules based on racial 
composition and working household status of sample numbers from PSUs of differing size. 
To provide increased precision in analyses related to this objective, an additional 500 primary 
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numbers were selected from the high- and medium- density strata. The 1900 primary numbers 
in the combined pilot study sample were then dialed and screened for their Black household 
status. If the sampled primary number reached a Black household, it simultaneously iden- 
tified three different PSUs. As shown in Table 1, every individual number can be viewed 
as belonging to a single 100-series, a single 200-series, and a single 400-series. For example, 
the number 313-764-4424 is a member of the 4400-4499 100-series, the 4400-4599 200-series, 
and the 4400-4799 400-series. To test the feasibility of expanding the PSU size, the pilot study 
sampled secondary numbers from each of these three hundred series. The second stage cluster 
sizes of Black households were set at 3 for the 100-series of the primary number, 6 for the 
200-series, and 9 for the 400-series clusters. In both the primary and secondary stages of 
selection, if the race of the household was not known, it was assumed to be a non-Black 
household. | 

Table 1 presents the disposition of the secondary numbers by PSU type and stratum. Of 
most interest is the proportion of secondary numbers assigned to Black households for the 
different PSU definitions. For the 100-series, .134 of all secondary numbers are Black 
household numbers. This implies that .223 of the households sampled were Black, compared 
to the .25 Black households found by Blair and Czaja. For the 200-series PSUs, .124 of all 
secondary numbers are Black household numbers. For the 400-series, .115 of all second stage 
sample telephone numbers are assigned to Black households. These proportions are all within 
sampling error of each other (the standard error of each estimate is at least .02). That is, 
no significant decrease in the proportion eligible was observed when the PSU definition was 
expanded from 100 to 400 consecutive numbers. These rates imply that while 100-series PSUs 
on the average can support second stage clusters of 13 or 14 sample Black households, the 
400-series might on the average support cluster sizes of 46 sample Black households. The 
ability to increase the Black household cluster size at the second stage of sampling enables 
the researcher to greatly reduce sample screening costs. 

Table 1 also compares the proportion of eligible secondary numbers for PSUs sampled 
from the three different strata used in the pilot study. For all the PSU definitions (100, 200, 
400) the same result applies — the large SMSA telephone exchanges in the high Black densi- 
ty stratum offer close to a doubling of the eligibility rate when compared to the rate for 
the overall population (.21 versus .12 or .13). The medium density stratum, consisting of 
non-SMSA exchanges in selected Southern states, has eligibility rates below that of the na- 
tion as a whole (between .08 and .10). The low density stratum, the remainder of the coun- 
try, also has lower than average eligibility rates (between .07 and .085). Since the high density 
stratum covers about 36 percent of the Black household population with telephones, the chosen 
stratification, in combination with disproportionate allocation of the primary stage samples, 
is an effective tool for reducing screening costs. 


3. THE PRODUCTION STUDY 


The production study used the stratification plan that was developed and tested in the 
pilot study. A disproportionately allocated sample of 11,223 primary numbers was selected 
from the three Black-density strata using sampling fractions in the ratio 3:2:1 (High:Medium: 
Low). Although the pilot study found no significant difference in the working household 
rate for PSUs of 200 and 400 consecutive numbers, a conservative decision was made to 
use the smaller 200-series PSUs in the production study. The expected second stage cluster 
size for each PSU was set at 5.5 Black households (not counting the primary number). Primary 
and secondary stage rejection rules for the modified two-stage Waksberg-Mitofsky design 
were identical to those used for the pilot study. Since much larger sample sizes were used 
in the production study, questions about precision and relative efficiencies of the design can 
be addressed with more confidence. 
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Table 1 


Pilot Study 
Disposition of Secondary Numbers Selected within 100-, 200- and 400-Series by Stratum 


Proportion of All Numbers Selected 


Stratum and Disposition 100- 200- 400- 
Series Series Series* 


High Density Black Stratum 


Black Households .205 201 214 
Don’t Know Race .028 .029 1032 
Non-Black Households 316 279 PS 
Nonresidential/ Nonworking 451 491 .479 
Number of Cases (395) (806) (1163) 


Medium Density Black Stratum 


Black Households .104 .080 .076 
Don’t Know Race .030 .018 .020 
Non-Black Households .494 .443 .420 
Nonresidential/ Nonworking euP- 459 .484 
Number of Cases (231) (560) (878) 


Low Density Black Stratum 


Black Households .085 .084 .069 
Don’t Know Race .014 .028 .027 
Non-Black Households OsZ Sep .607 
Nonresidential/Nonworking .369 311 297 
Number of Cases (141) (286) (491) 
Total 
Black Households .134 .124 AS 
Don’t Know Race .024 .025 .026 
Non-Black Households .442 .431 .448 
Nonresidential/Nonworking .400 .420 411 
Number of Cases (767) (1652) (2532) 


* Weighted estimate to compensate for the disproportionate allocation of the cluster of 9 secondary 
numbers across the separate 100-number ranges of the 400-series. 


Table 2 presents the results from both the primary and secondary number screening for 
the production study. The unbiased weighted estimate for an ‘‘epsem’’ two-stage RDD design 
suggests that 13 percent of all secondary numbers were Black households (the standard error 
about this estimate is .6 percent). This is in close agreement with the 12 percent secondary 
number eligibility rate observed in the pilot study. A comparison of the results for the primary 
stage of selection with those of the secondary stage illustrates the large gains possible by 
using a two-stage design for telephone sampling of Black households. The gains under the 
two-stage design are most dramatic in the low density Black stratum where there is nearly 
a nine-fold increase in the proportion of Black household numbers from the primary to secon- 
dary stage (.011 to .090). In the high density stratum the increase is closer to a twofold one 
(.072 to .190). For the disproportionate allocation design, the unweighted proportions 
of Black households at the two stages are 3 percent (primary stage) and 15 percent 
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(secondary stage). Comparison of these figures with the estimates for the epsem design (i.e., 
2 percent and 13 percent) indicates the reduction in screening achieved by disproportionate 
allocation. 

As in the pilot study, the percentage of Black households varies over the three strata, 
although the advantage to distinguishing the medium and low density strata is more evident. 
Across the three strata, the Black household eligibility rate for secondary numbers varies 
in an approximate 2:1.5:1 ratio. The three strata also differ in the total proportion of secon- 
dary numbers that are assigned to residences. The high density Black stratum has larger pro- 
portions of secondary numbers assigned to-nonresidential units, probably reflecting the 
urbanization levels of the exchanges in that stratum. 


Table 2 


Production Study 
Disposition of Numbers Selected by Stratum 


Stratum and Disposition Primaries Secondaries 


High Density Stratum 


Black Households .072 .190 
Don’t Know Race .035 .Q27 
Non-Black Households WAL) oz 
Nonresidential/Nonworking .674 .431 
Number of Cases (3,128) (6,671) 


Medium Density Stratum 


Black Households .032 141 
Don’t Know Race .020 .018 
Non-Black Households .188 .469 
Nonresidential/Nonworking .760 <i Pe 
Number of Cases (1,879) (2,375) 


Low Density Stratum 


Black Households O11 .090 
Don’t Know Race .019 .023 
Non-Black Households .199 .505 
Nonresidential/ Nonworking a .382 
Number of Cases (6,116) (3,987) 


Estimate for ‘‘Epsem Design’’* 


Black Households 021 .129 
Don’t Know Race .021 .023 
Non-Black Households .200 .454 
Nonresidential/Nonworking my sets 394 


Proportion Black Households 
for Disproportionate Design .031 sI50 


Number of Cases (11,123) (13,033) 


* Weighted estimates of ‘‘epsem design’’ rates. Weights compensate for disproportionate sampling 
rates used to select the Production Study sample from the three density strata. 
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Each PSU of 200 consecutive numbers can be viewed as two half-PSUs of 100 numbers 
each. Table 3 demonstrates that proportions of nonresidential numbers (.378) found in the 
half-PSU (100-series) in which the sample primary number fell are lower than in the other 
half-PSU (.409), but this difference is not statistically significant at the .05 level (standard 
error about .02). Similarly, the proportion of Black households is somewhat larger in the 
100-series of the primary number (.133) than in the adjacent 100-series (.125). Again, this 
difference is not likely to be found in most replications of the experiment. Table 3 provides 
another perspective on the results in Table 2, showing only a negligible reduction in the pro- 
portion eligible in 100-series adjacent to that of the primary numbers. 

The average eligibility rate - proportion of Black households - across PSUs should not 
be the only criterion for evaluating the sample design. In order to implement an epsem design 
within strata, each PSU in the design must have a sufficient number of Black households 
to support the designated number of second stage sample Black households. Thus, the distribu- 
tion over PSUs of the proportion eligible is also of interest. Figures 1, 2 and 3 contain 
histograms describing the distribution over all the PSUs of the proportion of Black households 
by stratum. The stability of the three distributions varies because the number of sample PSUs 
is about four times greater in the high density stratum than the other two (224 PSUs in the 
high density stratum to about 60 in the medium and low density strata). The shapes of the 
distributions, however, appear to be very different for the three strata. The distributions 
for the low and medium density strata are highly skewed, with 60 percent of PSUs in the 
medium density stratum and 65 percent of PSUs in the low density stratum having 5 to 20 
percent Black households. These eligibility rates correspond to a maximum of 10 to 40 sam- 
ple Black households for the 200-series PSUs from the low and medium density stratum. 
In the production study the low density stratum contained several PSUs that would not per- 
mit those cluster sizes (6 of the 63 PSUs in that stratum are estimated to have fewer than 
10 Black households). The distribution in the high density stratum is much more uniform 
(4 of the 224 PSUs estimated to have fewer than 10 Black households). 

These distributions of percentage Black households by PSU deserve more discussion. Given 
our current understanding of the assignment of residential numbers to available banks of 
numbers, there is no reason to believe that within an exchange (or a prefix) there are general 
tendencies to assign different residential areas to different 100-series. That is, within an ex- 
change serving both Black and non-Black households the hypothesis of assignment of numbers 
without regard to the race of the subscriber is a strong one. Stated alternatively, 


Table 3 


Production Study 
Disposition of Secondary Numbers by Whether in 
Same 100-Series as Primary Numbers 


Disposition 
Status Same 100-Series Adjacent 
as Primary Number 100-Series 
Black Households i lsh: i 5) 
Don’t Know Race .024 022 
Non-Black Households .465 .444 
Nonresidential/ Nonworking .378 .409 


Number of Cases (6,522) (6,511) 
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Figure 1. Percentage of High Density Clusters By Proportion of Black Households 


Percentage of Clusters 
40 


30 


20 


10 


.05 .25 .50 15 


Proportion of Black Households 


Figure 2. Percentage of Medium Density Clusters By Proportion of Black Households 


Survey Methodology, June 1987 fe) 


Percentage of Clusters 


40 

39.7 
30 

25.4 
20 

2.7 
10 9.5 
3.2 
1.6 -4l.6.1.6 
0 
.05 x8) .50 5 


Proportion of Black Households 


Figure 3. Percentage of Low Density Clusters By Proportion of Black Households 


unless the exchanges are subdivided into wire centers that correspond to the residential loca- 
tions of Black households, there is no a priori reason for large amounts of clustering of Black 
households within 200-series. Following this logic, the more uniform distribution in the high 
density stratum reflects, we believe, the variability in proportions of Blacks among the 
telephone populations in the different exchanges in the stratum. 


4. SAMPLING VARIANCE PROPERTIES 


To achieve greater cost-efficiency in the RDD sampling of Black households it is advan- 
tageous to use both large clusters of sample households per PSU (i.e., for a fixed sample 
size, a smaller number of PSUs) and disproportionate allocation of PSUs to strata of ex- 
changes which vary in their proportion of Black telephone households. While both greater 
clustering and disproportionate allocation of the sample improve cost-efficiency, the overall 
precision of the sample is affected by the increased clustering effects and added design ef- 
fects due to the non-optimal weighting that is required to compensate for the unequal selec- 
tion probabilities for households from the three density strata. Increased design effects of 
sample estimates due to non-optimal weighting are described in Kish (1976). The clustering 
influence on the design effect for the modified RDD procedures is developed in the follow- 
ing paragraphs. 

Ceteris paribus, the larger the number of sample elements chosen per PSU the higher 
the design effect (the ratio of the sampling variance of the given design to that of a simple 
random sample with the same number of elements). The model often used is 
Deff = 1 + p(b — 1), where Deff is the design effect, p is the intracluster correlation for 
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the statistic, and b is the number of sample elements per PSU. Others have shown for many 
variables on the total U.S. household population that the intracluster correlations for the 
100-series tend to be smaller than those generally found in area probability sample clusters 
(see Groves, 1978). This may not be the case for the Black population for 100-series, and 
there are no empirical estimates available concerning intracluster correlations for 200-series 
clusters. The expectation prior to estimating sampling errors was that there would be no change 
in the intracluster correlations between the 100- and 200-series. This hypothesis reflects the 
understanding of the assignment of telephone numbers within exchanges that was described 
above. 

Based on sampling errors estimated from the production study data set, the average design 
effect for a selected set of seven survey statistics is 1.28 for the 100-series and 1.30 for the 
200-series. The 100-series average design effect was estimated from those cases which fell 
into the 100-series of the primary number, while the cases from the entire 200-series were 
used in computing the average 200-series design effect. Thus, the average cluster size of com- 
pleted interviews is 2.0 for the 100-series (coefficient of variation, .043) and 3.4 for the 
200-series (coefficient of variation, .029). These design effects reflect all the stratification, 
clustering and weighting in the design and also the fact that the variability in the cluster sizes 
in the 100-series is greater. (The rejection rule forced an equal number of sample Black 
households at the 200-series but not necessarily at the 100-series level.) Given that the average 
design effects for the 100-series and the 200-series are close to one another (1.28 to 1.30), 
the dominant influence on the sampling variance appears to be non-optimal weighting re- 
quired by the disproportionately allocated sample design, with little loss in precision due 
to PSU size alone (moving from the 100- to the 200-series clusters). 

Table 4 (page 11) presents the synthetic intracluster correlations by stratum for the seven 
survey Statistics used to compute the estimate of average design effect. The estimates of syn- 
thetic intracluster correlations were obtained from the design effect, following Kish’s model 
of Rho = (Deff — 1)/(6 — 1), and are unweighted so as to remove the confounding effect 
of weighting on the synthetic estimates. The estimates in the table tend to be unstable due 
to the small number of clusters in each stratum, the small average cluster size of completed 
interviews, and its associated coefficient of variation. These sample design features com- 
plicate our inference about clustering effects in the 100- versus the 200-series. Overall, the 
100-series estimates of intracluster correlation are somewhat higher than those in the 200-series. 
We believe that this reflects more an instability in the estimated synthetic correlation than 
a real difference in clustering effects. We believe that these estimates provide little evidence 
that there is a change in the intracluster correlation between the 100- and 200-series. 


5. OPTIMAL DESIGN FEATURES 


The previous sections of the paper address the effect of alternative sample features on 
cost-efficiency and sampling variance. Survey costs and errors are often combined at the 
design step to address whether ‘‘optimal’’ features of the survey can be identified. This ap- 
proach attempts to identify the design which offers minimum variance for a fixed set of 
resources allocated to the survey. Given the data in this research we can estimate the optimal 
choices of two design attributes: a) number of sample elements per PSU, and b) allocation 
of the sample across the three ‘‘Black-density’’ strata. 

To determine the optimal cluster size we use a total cost model, C = C, + C,a + C,ab, 
where C, represents fixed costs, C, is the sampling and screening cost for each sample 
cluster, of which a are selected, and C, is the sampling, screening and interviewing cost 


Survey Methodology, June 1987 11 


Table 4 


Production Study 
Synthetic Intracluster Correlations 
for 100- and 200-Series Clusters for Seven Statistics by Stratum 


Synthetic Intracluster Correlation* 


High Density Medium Density Low Density 

Statistic Black Stratum Black Stratum Black Stratum 
100- 200- 100- 200- 100- 200- 
Series Series Series Series Series Series 


Proportion Very Satisfied 

with Life as a Whole .021 -.002 -.172 -.042 -.238 -.116 
Proportion Who Think They 

Are Better Off Financially 


Than One Year Ago 13 .075 .094 .069 .206 .049 
Proportion Who Will Vote 

for Mondale .189 .021 .086 -.087 -.436 -.046 
Proportion Who Attend Church 013 .017 -.009 -.078 .035 -.110 
Proportion in Same City 

or Town All of Life -.078 .001 .058 114 ok .248 
Proportion Voted in 1980 

Presidential Election -.045 -.035 -.101 -.013 .364 .356 
Proportion Who Think Reagan 

Will Be Elected President -.045 -.045 -.545 -.078 .124 -.105 


Average .024 .005 -.084 =016 .039 2059 


* These estimates are unweighted. 


associated with each interview obtained, of which there are b in each cluster. Because the 
proportions of Black households vary across the three strata in the design, the C, and C, 
parameters vary across strata (see Table 5). The optimal cluster size is computed as 
VC, — p)/(Cyp) (Kish, 1965). Using cost data from the production survey, Table 5 
presents estimated optimal cluster sizes for overall means and proportions with three alternative 
levels of intracluster correlation; .005, .01, and .02. (These values are similar to those obtained 
for attitudinal and behavioral variables in the actual surveys.) The C, and C, cost estimates 
for each stratum also appear. The Table shows that the optimal cluster sizes are largest in 
the low density stratum, reflecting the high screening costs in that group. Note also that these 
optimal cluster sizes tend to be larger than those actually used in the survey, b = 6.5. 

Note further that the optimal cluster sizes are similar for 100- and 200-series PSUs and 
the loss of cost-efficiency of the 200-series relative to that of the 100-series is minor and similar 
optimal cluster sizes result. (The sampling variance estimates also imply that intracluster cor- 
relations in the 100- and 200-series clusters are similar.) 

The optimal cluster sizes in Table 5 generally exceed the levels that could be supported 
with a 100-series PSU definition. That is, a large proportion of 100-series PSUs would not 
have a sufficient number of Black household numbers to fulfill the designated second stage 
cluster size. For that reason alone, the 200-series is favored. Even with 200-series, the specified 
second stage cluster sizes could not be obtained for some PSUs in the low density stratum. 
(This suggests the true optimal cluster size solution should be constrained to reflect the 
capacities of the PSUs and the approach used here is useful to guide practical decisions on 
cost-efficiency, but does not reflect some extreme conditions.) 
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Table 5 


Cost Parameters and Optimal Number of Sample Elements Per Cluster, 
by Stratum for 100- and 200-Series Clusters and Different p Values 


Optimal Cluster Size Cost Parameters 


Stratum and 
Cluster Definition 


$114.09 


Low Density Stratum 


$309.98 


The second design decision evaluated is the choice of sample allocation to strata. The 
survey used sampling fractions in the ratio of 3:2:1 from the high density to the low 
density stratum. We explored the optimal allocation across strata, assuming that the 
optimal cluster sizes were chosen in each stratum (as shown in Table 5). Given a fixed 
cluster size in each stratum, b,, we set the sampling fraction in the /-th stratum, /,, pro- 


portional to V (Deff;,S";,)/(Cna/b;,), where Deff;, is the design effect for the statistic in 
the A-th stratum, S;,” is the element variance in the A-th stratum, C,,, is the sampling and 
screening costs for PSUs in the A-th stratum, and b, is the number of sample elements per 
cluster in the A-th stratum. 

Table 6 presents optimal ratios of sampling fractions for various combinations of ele- 
ment variances in the three strata and the various p values. The Table shows that the optimal 
allocations across strata are relatively insensitive to changes in p values (for the range of 
p values that are likely given this design). If the strata with higher densities of Black households 
have element variances at least equal to that of the low density stratum, an oversampling 
of those strata is desirable. (This reflects the much lower costs in those strata.) The 3:2:1 
ratio of sampling fractions is best when the ratio of strata standard deviations is about 
1.7:1.5:1. An examination of the data obtained from the survey suggests that many variables 
have ratios of standard deviations across the three strata close to 1:1:1. For such variables 
the optimal ratio of sampling fractions is 1.7:1.4:1, given the optimal cluster sizes shown 
in Table 5. (With the cluster size of 6.5 actually used in each stratum, the optimal fractions 
have the ratio 2.5:1.6:1.) Both these ratios of sampling fractions suggest that the oversampling 
actually used in the production study created a loss of precision per unit cost, relative to 
that corresponding to the optimal sampling fractions. 
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Table 6 


Optimal Allocation of the Sample Across Strata for Overall Means, Given 
Optimal Cluster Sizes in Each Stratum, for Various Relative Standard 
Deviations Across Strata and Values of Intracluster Correlations 


Ratios of Within Stratum 
Standard Deviations 
(High:Med:Low) 


Ratios of Optimal Sampling Fractions 
(High:Med:Low) 


pe— O05 


6. SUMMARY 


Rare population sampling forces the survey statistician to consider combinations of PSU 
and cluster definitions, stratification, and alterations of measures of size which are not typically 
found in cross-section samples. This research found that these traditional sample design techni- 
ques can be adapted to increase the efficiency of two-stage telephone samples for the Black 
household population with telephones. 

First, this research found that even the rough correspondence between telephone exchanges 
and large cities and states permitted stratification that successfully discriminated exchange 
groups with vastly different eligibility rates. The high density stratum had over twice the 
proportion of Black households as did the low density stratum. This permits control over 
screening costs in sample implementation. With other rare populations which are residen- 
tially segregated, similar results are expected. 

Second, the use of rejection rules based on subpopulation eligibility effectively reduced 
screening costs within PSUs. This increases the eligible proportion of secondary numbers 
from twofold to ninefold, depending on which density stratum was considered. 

Third, use of a larger PSU (200- versus 100-series of consecutive numbers) produced no 
serious loss of eligibility. Hundred series densely filled with eligible numbers tend to be adja- 
cent to others densely filled. This is a discovery concerning the practice of assigning numbers 
by telephone companies. This fact permits larger numbers of sample numbers per PSU, 
another key feature in reducing the costs of the Black population sample. 

Despite great pressures for cost reduction in rare population samples, it is important to 
balance errors and costs explicitly in choosing the final design. In this research such cost 
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and sampling error modeling suggested that disproportionate allocation of the sample to Black- 
density strata is desirable. In addition, it is most efficient to select a relatively large set of 
secondary numbers per PSU. This set is sufficiently large that the 200- or 400-series PSU 
definition must be used. 

Although we have applied this design only to the Black population, its performance should 
be similar for other residentially segregated populations. This includes income groups, cer- 
tain occupational groups, and ethnic groups. 

In addition, the discoveries of this research may also have implications for cross-section 
samples. Increasing the PSU size from 100 to 200 consecutive numbers may be advantageous 
in a two-stage RDD design for sampling the general telephone household population. The 
larger 200-series would provide twice as many numbers to select from and, as with the rare 
population, the proportion of eligible numbers would tend to be similar to that found in 
the 100-series. Therefore, given low intracluster correlation values, the cluster size of eligible 
numbers for a design could be set much closer to the optimal size. Because all PSUs selected 
would be able to support the chosen number of sample numbers, the achieved cluster size 
of eligible numbers should also be less variable over PSUs and therefore the impact of com- 
pensating weighting on the variance of estimates should not be great. 
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ABSTRACT 


This paper presents results from methodological experiments comparing telephone and face-to-face 
interviewing in surveys of the general population. The relatively low level of telephone ownership in 
the United Kingdom, especially among the less privileged, argues the need for a dual-mode approach 
combining telephone interviews with face-to-face interviews for those without telephones. This approach 
depends on the absence of differential mode-effects on the answers obtained or on the ability to account 
for these effects when they occur. 
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quality. 


1. INTRODUCTION 


The choice of a mode of data collection for a survey depends upon the availability of 
facts about the alternatives. In the U.K., such facts about telephone interviewing have just 
recently begun to emerge. The necessary comparisons between telephone interviewing and 
other data collection modes have been carried out only in the last two years. This delay is 
surprising given the lively debate about the merits and drawbacks of telephone interviewing 
and the attention which the issue has received in other countries. 

Two studies conducted by the Survey Methods Centre at Social and Community Plan- 
ning Research comparing telephone and face-to-face interviewing provide the focus for this 
paper. Carried out in 1983 and 1984, these studies examine some of the central issues: the 
public’s willingness to take part in telephone surveys and the kind, quality and volume of 
data that can be collected. The studies are described in Section 2 and their results presented 
in Sections 3 and 4. Reference is also made to another British study - an experiment carried 
out in 1985 by the Market Research Development Fund - and to the larger volume of 
methodological research conducted in other countries, particularly the United States. 


2. THE SCPR STUDIES 


Our research program reflected telephone ownership which is low by North American stan- 
dards: about 75% of households possessed telephones in 1983. Non-coverage is substantial 
and crucial, for social researchers, because of its bias towards less affluent sectors of British 
society. In this context, the main objective was to evaluate dual-mode interviewing, where 
telephone owners would be interviewed by telephone, and non-owners face-to-face. 

The first study provided two comparisons towards this evaluation: between an experimental 
dual-mode sample and a larger national sample interviewed face-to-face; and between two 
samples of telephone owners, one sample interviewed by telephone, the other interviewed 
face-to-face. In this paper, we focus on the latter comparison, which addresses the question 
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that lies at the heart of any evaluation of the dual-mode approach: are telephone and face- 
to-face data compatible or are there modal differences between them? If there are modal 
differences, the data cannot be ‘‘added’’ together and treated as a single data set without 
the kind of adjustments not usually possible in a one-time survey. The second study concen- 
trated only on this direct comparison between the two interview methods among telephone 
owners. 


2.1 Study 1 


The first study was conducted alongside the 1983 British Social Attitudes Survey, which 
is here referred to as the ‘‘main’’ survey. This survey involved face-to-face interviews of about 
an hour, covering a wide range of political, economic, social and moral issues. 

The sample for the main survey was about 1,750, and was representative of adults aged 
18 or over living in private households. For practical reasons, the sample was confined to 
those at addresses in the Electoral Register. People living in institutions (though not private 
households at such institutions) were excluded, as were the 4% of adults known to live at 
addresses not on the Electoral Register (Todd and Butcher 1982). 

A multi-stage design was used with four stages of selection: 103 constituencies in England 
and Wales and 11 local authority districts in Scotland were selected with probability propor- 
tional to electorate; within each a single polling district was selected, again with probability 
proportional to electorate; from each polling district, 23 addresses were selected with pro- 
bability proportional to the number of electors registered at the address. At the final stage, 
One person at each address was selected by the interviewer, using an adaptation of the 
Marchant-Blyth procedure (Blyth and Marchant 1973). 

For the experiment, a parallel sample of about 800 addresses (seven per area) was selected 
from the same 114 sampling points. These addresses, together with all the names in the Elec- 
toral Register, were submitted to British Telecom’s telephone number-retrieval facility. The 
facility yielded telephone numbers for 65% of the submitted addresses. Most of the difference 
between this retrieval rate and the level of telephone ownership - around 75% at the time 
— can be explained by ex-directory numbers: about 12% of telephone numbers in Great Bri- 
tain are ex-directory, with regional and other variations as noted by Collins and Sykes (1987). 
Other problems in tracing telephone numbers seem to have had little effect. 

The following procedure was used by British Telecom for retrieving telephone numbers: 
once the correct telephone exchange area had been identified by the address, the subscriber’s 
name was looked up in the directory. Specific address details (i.e., the street name) helped 
distinguish between subscribers with identical names. Since it is not clear from the Electoral 
Register which of the names at an address is that of the subscriber, British Telecom was 
asked to check every name before abandoning a search. 

The telephone numbers obtained were systematically assigned to four sub-samples. Two 
of these were interviewed by telephone using a questionnaire expected to take about 20 minutes 
to complete. The questions were drawn from all sections of the main Social Attitudes ques- 
tionnaire. The other two sub-samples were interviewed by telephone using a longer ques- 
tionnaire - estimated at 40 minutes - that was also drawn from the main survey questionnaire. 
Sub-samples allocated to both the 20-minute and the 40-minute questionnaires were sent a 
letter before the telephone calls. The other sub-samples received no advance warning of the 
survey. In all cases the selection of a respondent for interview was on the same basis as for 
the main survey. 

Experimental sample addresses for which no telephone numbers could be obtained from 
British Telecom were given face-to-face 20-minute interviews. Combined with those obtained 
by telephone, these interviews formed a dual-mode survey that was compared with the main 
face-to-face interview survey (Sykes and Hoinville 1985). 
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A more direct examination of interview mode effects was sought by submitting a systematic 
sub-sample of 600 of the main sample addresses (five in each area) to British Telecom’s 
number-retrieval service. In this case, numbers were returned for 55% of the addresses (the 
variability of the success rate of the British Telecom number-retrieval service remains unex- 
plained). Comparisons were then made between those who were interviewed by telephone 
and those who could have been interviewed by telephone but were interviewed face-to-face. 
By restricting comparisons to the telephone-accessible population, we controlled for effects 
attributable to differences between the compared populations rather than to differences in 
the mode of data collection. 


2.2 Study 2 


The second experiment concentrated on this direct comparison. About 2,300 addresses 
were selected from the Electoral Register, as in Study 1, and were sent to British Telecom 
for telephone numbers (with in this case, a 61% retrieval rate). Addresses for which telephone 
numbers were retrieved were split into three sub-samples. One group was interviewed by 
telephone using ‘‘pencil and paper’’ methods; another was interviewed using Computer 
Assisted Telephone Interviewing (CAT]); the third was interviewed face-to-face. Our experi- 
ment with CATI was a practical failure (for a number of reasons), but the other two sub- 
samples again give us a direct comparison between people interviewed by telephone and peo- 
ple who could have been interviewed by telephone but were interviewed face-to-face. The 
questionnaire, designed to take 25 minutes, consisted of a sub-set of questions from the 1983 
British Social Attitudes Survey. 


2.3 Limitations on the Comparisons between Interviewing Modes 


Three factors could limit comparisons between the answers obtained face-to-face and those 
obtained over the telephone. First, differential non-response (as discussed in Section 3) could 
have led to differences in the composition of the respondent sets. This possibility was tested 
using a number of demographic and socio-economic variables believed to be associated with 
certain attitude variables. Significant differences between the respondent sets suggest that, 
quite apart from any differences between the modes in overall response levels, certain kinds 
of people are more likely to participate in a telephone rather than a face-to-face survey, and 
vice versa. The variables examined were: age within sex, marital status, household composi- 
tion, economic status, socio-economic group and geographical location. No statistically signifi- 
cant evidence of differential non-response was found in the first study. In the second study, 
two variables showed statistically significant differences between the telephone and face-to- 
face samples: household composition (the telephone respondents included a higher propor- 
tion of childless couples under 60, while the face-to-face sample had a higher percentage 
of couples with young children and teenagers); and socio-economic group (intermediate and 
junior non-manual workers and those in ‘‘other’’ occupations had greater representation 
in the telephone sample than face-to-face, and ‘‘homemakers’’ were a higher proportion of 
the face-to-face sample). These differences may well represent only sampling fluctuations, 
but they should lead to some caution in the interpretation of differences between the answers 
of the two samples. 

The second possibility is of different levels of skill or supervision between the telephone 
and face-to-face interviewers. Six telephone interviewers were employed on the first experimen- 
tal survey. Two were fully trained and experienced face-to-face interviewers, but the remainder 
had had no previous interviewing experience and so received basic interviewer training as 
well as the special telephone interviewing training that all six interviewers underwent. The 
second study involved 10 interviewers, three of whom had worked on the previous study. 
As in the previous study, a supervisor was present to listen in, advise on interviewing techni- 
que when necessary and check for obvious errors in completed questionnaires. 
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The face-to-face interviewers for both studies were drawn from Social and Community 
Planning Research’s panel of about 300 regularly employed face-to-face interviewers. Their 
training in basic interviewing techniques was similar to that given to the telephone interviewers. 
However, for the most part, the face-to-face interviewers were more experienced than 
their telephone counterparts. Differences between the two groups of interviewers should, 
therefore, be kept in mind, especially differences suggesting lower quality in the telephone 
interviews. 

The third factor is the questionnaires. The main Social Attitudes questionnaire, compris- 
ing about 100 questions, was divided into five broad topic areas: employment, education, 
health and housing, issues of social class, and racial and sexual equality. The experimental 
questionnaires were composed of those questions considered most important in the main 
survey. These questions were chosen to represent the full range of question types in the main 
questionnaire. 

As a result, the experimental questionnaires covered a range of topics (including some 
‘*sensitive’’ issues) and included questions involving different kinds of response tasks and 
levels of complexity. The order of the questions on the Social Attitudes Survey was main- 
tained for both the 20-minute and 40-minute experimental questionnaires used in the first 
study and for the 25-minute questionnaire used in the second study. Thus the 40-minute ques- 
tionnaire was not made up of the short questionnaire followed by a further 20 minutes of 
questions: rather, questions from the 20-minute version were spread throughout. Alterations 
to question wording were made only when unavoidable; for example, re-wording to adjust 
for the necessary absence of showcards. The Social Attitudes Survey questionnaire consists 
largely of closed questions, so few of the results from our experiments relate to open questions. 

All of these limitations should be kept in mind when examining our results, but they are 
largely inevitable in such comparative studies. As described above, we have tried to identify 
and minimize them. They are of great concern only when our results suggest mode effects 
that might confound the effects of other variables: most of our results do not point to this. 
Thus the limitations should be considered only as potential sources of effects counteracting 
mode effects we might otherwise have found - surely a less serious threat to the validity of 
our conclusions. 


3. RESPONSE RATES 


In the U.K., doubts about the feasibility of telephone interviewing, particularly for social 
surveys, stem from concerns not only with the level of communication possible, and its ef- 
fect on both cognitive and affective dimensions of the interview, but also with the general 
social acceptability of this use of the telephone. In Britain, it is a common belief among 
researchers that ‘‘cold calls’’ from strangers are likely to be treated with circumspection: 
a call from a telephone interviewer may be regarded as inappropriate and intrusive. 

A common counter argument points out the possible advantages telephone interviewing 
has over face-to-face interviewing, particularly in inner city areas. Escalating personal and 
property crime has led to increasing suspicion of strangers, which means falling response 
rates and the installation of devices such as entry-phones that make it harder for personal 
interviewers to contact respondents. By telephone, contact will also certainly be made at an 
address if someone is there, and, if not, subsequent attempts are not expensive. 

Table 1 shows the response rates for both studies conducted by the Survey Methods Centre. 
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Table 1 
SCPR Experiments: Response Rates 
Study 1 Study 2 
Telephone Face-to-Face Telephone Face-to-Face 
Bases (429) (313) (730) (631) 
% % % % 
Completed interviews 53 60 46 68 
Partial interviews 1 - - ~ 
Refusal (no selection) 5 Z 21 6 
Refusal (proxy) 9 5 7 4 
Refusal (selected person) 11 18 10 11 
No contact? 3 1 8 4 
Selected person never in 3 3 3 j4 
Ill, away, language problems 2 5 2 4 
Other? 13 6 4 2 
Brudy 1X” = 3.72 ‘d.of. = 1 0.05 <p < 0:1 
comparisons with only two categories: 
Study 2: X* = 66.22 d.o.f. = 1 p < 0.001 completed interviews and 


Studies 1 and 2 combined: x2 = 59.46 d.o.f. = 1 p < 0.005 Be eeu ee UL ACS: 


* Includes ‘‘Ring no answer’’ and ‘‘Permanently engaged’’. 


> Includes ‘“‘Broken appointments’, ‘‘Too old’’, ‘‘Incapacitated’’, ‘‘No connection’’, ‘‘Right number, wrong address’’. 


Response to these studies, for both the telephone and face-to-face components, was 
relatively low. (We would normally expect personal interview response rates of over 70% 
before reissue of refusals.) This owes something to the nature of the surveys - general pur- 
pose surveys are notoriously difficult to ‘‘sell’’ to respondents. The same argument can also 
be applied to the only other major British methodological comparison survey, carried out 
by Marplan on behalf of the Market Research Development Fund. This study used the same 
sampling method as our own experiments and also included a wide range of general ques- 
tions, under the title Lifestyle in the 1980’s. In this case, the response rates obtained were 
45% by telephone with a sample base of 1697 and 67% face-to-face with a sample base of 
1233 (Market Research Development Fund 1985). In both our studies, the response rate was 
lower for telephone interviews: barely half of the issued addresses yielded interviews. As Table 
1 shows, the difference was on the borderline of non-significance for Study 1 but was 
statistically significant for Study 2 and for Studies 1 and 2 in combination. 

The difference might be attributed to our relative lack of experience with telephone inter- 
viewing, but it is consistent with findings from other countries. For example, in the United 
States lower response rates - mostly arising from the higher incidence of refusals to cooperate — 
have been reported by a number of authors (e.g., Hochstim 1967; Henson, Roth and Can- 
nell 1977). The position is summarized by Groves and Kahn, who write: 


**The response rate of national surveys remains at least five percentage points lower than 
that expected in personal interview. This has been a rather stable comparison despite 
changes over time in training of interviewers, monitoring techniques, feedback procedures 
from monitors, and techniques of introducing the survey to the respondent.’’ (Groves and 
Kahn 1979; p. 219) 


These findings suggest that sociological and psychological explanations of resistance to 
the telephone approach may be more appropriate than explanations of interviewer and general 
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methodological inexperience. However, the first SCPR study appears to have been rather 
more successful than either the second or the MRDF study. It has been suggested that this 
difference was due to the interest and excitement surrounding the first experiment. This may 
have communicated itself to the interviewers (for example, researchers were continually ‘‘drop- 
ping in’’ to observe the proceedings), thus affecting their success rates. Certainly, experience 
with face-to-face surveys suggests that interviewer morale and energy are important for good 
response rates. 

In the SCPR studies two survey conditions were varied to assess their impact on telephone 
response rates. For the first survey, half the telephone respondents were asked to do 20-minute 
interviews and the other half did 40-minute interviews (respondents were told the length of 
the interview towards the end of the introduction), and in both surveys advance letters giving 
notice of the interview were sent to a random half of the telephone sample. 

Table 2 shows that response for the 40-minute interview was lower than for the 20-minute 
interview, although the difference between the overall distributions was not significant. The 
main single reason for this lower response was the higher direct refusal rate, possibly in- 
dicating that respondents were less willing to undertake the longer interviews. However, very 
few respondents who had agreed to participate terminated an interview prematurely - even 
with the longer interview. 

Different strategies may be needed for longer questionnaires. While it may be reasonable 
to request respondents to take part in a 20-minute interview at the time when first contact 
is made, a system of appointments may be more successful where more interviewing time 
is required. Wiseman and McDonald (1979) suggest that refusal rates are likely to be lower 
when interviewers are instructed to make call-back appointments should the respondents 
indicate that they are busy. 

In other studies, sending advance letters to potential telephone respondents has been 
found to improve response rates. For example, Dillman, Gallegos and Frey (1976) obtained 
refusal rates which were, on average, 6% lower for respondents receiving advance letters 
(compared with 14%). As Table 3 shows, in the SCPR experiments response rates were 
slightly higher among respondents who had been sent an advance letter (no record was 
kept of whether letters had been received) although the differences were not statistically 
significant. 

To explore why respondents refuse to be interviewed by telephone, 55 refusers to the first 
study were followed-up to see whether they would have co-operated at the first contact if 
they had been approached personally. Forty said that the method of interview would have 
made no difference to their decision, and only a very small number of these people subse- 
quently agreed to be interviewed. Most of the rest said they would have taken part if they 
had been approached face-to-face and eventually completed a face-to-face interview (13 out 
OfelS): 

Because face-to-face refusers were not followed up, we do not know if a proportion of 
this group would have preferred to be approached by telephone. 


3.1 Response Differences and Data Quality 


The public’s perception of the proper use of the household telephone may effect not only 
response rates, but also the kinds of questions respondents will be prepared to answer. Of 
even greater concern, however, is the type of communication possible between interviewer 
and respondent and its potential effect on the measurements made. 

Face-to-face communication takes place both verbally and non-verbally, while the telephone 
has only limited channel capacity with exchanges between interviewer and respondent restricted 
to what is said and so-called paralinguistic cues: tone of voice, pauses and so on (Miller and 
Cannell 1982). 
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Table 2 
SCPR Experiments: Effects of Interview Length (Study 1) 
40-Minute 20-Minute 
Bases (206) (223) 
% % 
Completed interviews 48 59 
Refusal ZT 23 
Other 25 18 
XA edo.f2 =92 0:10 > p> 0.05 
Table 3 
SCPR Experiments: Effects of Advance Letters on Response Rates 
Study 1 Study 2 
Letter No Letter Letter No Letter 
Bases (215) (214) (388) (392) 
% % % % 
Completed interviews 55 51 48 43 
Refusal 23 aT 37 38 
Other 22 21 15 19 
Studyal Ace) 09. dso.f.7= 2p, >.0.5 
Studyeae = 2:8 d.0.f. = 2. p > 0.2 
Studies 1 and 2 combined: X? = 3.49 d.o.f. = 2 p> 0.1 


The possible implications for survey measurements of the telephone’s limited channel 
capacity are numerous. For example, the absence of visual aids may increase the difficulty 
of some response tasks. ‘‘Voice only’? communication may not convey the full meaning behind 
respondents’ words (making it difficult, for example, to probe open-ended questions) and 
may not reveal if they actually understand the questions. There may also be limitations on 
the interviewer’s ability to perform his or her role. Can verbal signals, for example, replace 
the non-verbal cues that convey interest and attention to the respondent, or those that help 
control the interview? Can the interviewer hold the concentration of the respondent, par- 
ticularly in long interviews? Conversely, is the absence of visual stimuli a desirable reduction 
in the many sources of variability in survey data? Finally, does the greater social distance 
in the telephone interview make the respondent more or less comfortable in revealing sen- 
sitive information such as income, or information with a strong social desirability component? 

SCPR’s experiments addressed some of these issues. 


3.1.1 General Comparisons 


Given the different refusal rates of the interviewing modes, it is surprising that there are 
few other general differences. This result has been replicated in many studies in the U.S. 
(Groves and Kahn 1979; Lucas and Adams 1977; Jordan et al. 1980; Colombotos 1969; 
Wiseman 1972), and in other countries such as Denmark (Kormendi et a/. 1986). Simple 
straight-forward questions asked identically by telephone and face-to-face yield similar 
distributions of response. 
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In the SCPR studies the marginal distributions of response yielded by the different modes 
of interview were compared and differences were tested for statistical significance using chi- 
squared tests. These tests were performed on unweighted data. However, tables in the text, 
unless otherwise indicated, show distributions of data weighted to take account of any dif- 
ferences between the number of people listed on the Electoral Register and those found at 
an address. Such differences occurred in approximately 25% of cases, in each of which the 
data were weighted by the number of persons aged 18 or over living at the address divided 
by the number of electors listed on the Register for that address. Weighted tables are given 
to allow readers to decide if they might draw different conclusions from telephone survey 
data and face-to-face survey data when both sets have been prepared according to routine 
procedures. 

Standard chi-squared tests were performed even though the data arose from a multi-stage 
sample. It has been shown (see, for example, Holt, Scott and Ewings 1980) that 
underestimating true variability by ignoring sample design will generally lead to test statistics 
which are too large, and hence to the false rejection of null hypotheses (i.e., to anti- 
conservative tests). For the Social Attitudes Survey, however, estimation of true standard 
errors for attitudinal variables yields Design Factors (the ratio of the complex standard error 
to the simple random sampling standard error) which are rarely above 1.2 (Jowell and Withers- 
poon 1985). Further, the literature argues that in 2-way tests of independence the consequences 
of clustering are likely to be less severe (Holt, Scott and Ewings 1980). As a result, we feel 
justified in using standard chi-squared tests to avoid the large amount of computation 
necessary for corrected statistics. If anything, this approach will overstate the significance 
of differences between interview modes. 

In the first study we looked at 95 questions and parts of questions and in the second study 
69. The results are shown in Table 4. It is clear that in both studies the results accorded with 
those of other researchers: the interviewing modes yield significantly different distributions 
of answers for only a very small percentage of questions. A similar finding emerged from 
the MRDF study. 


3.1.2 Comparisons for Particular Question Forms 


Despite the general result, research in the U.S. has shown that there are specific kinds 
of questions for which differences in response distributions do occur. For example, Groves 
and Kahn (1979) demonstrated a tendency for respondents to give truncated answers to open- 
ended items over the telephone. This might be due to the faster pace of telephone interview- 
ing, as noted, for example, by Dillman (1970) and Williams (1977). Both interviewers and 
respondents tend to speak more quickly on the telephone and to avoid silent pauses. The 
swifter pace of telephone interviews was shown in our second experiment. As Table 5 shows, 
with an interview designed to take 25 minutes, 10% of the telephone interviews were con- 
ducted in under 20 minutes, compared with 5% of face-to-face interviews. At the other 
extreme, 41% of face-to-face interviews took more than half an hour compared with under 
a third of the telephone interviews. 

Ball (1980) suggests that the greater speed may occur because the norms of telephone con- 
versations require both the interviewer and respondent to work to maintain the conversa- 
tional flow. This may leave respondents with less time to think about their answers. Certainly, 
silences seem to make people uncomfortable - in a study by Jordan (1980) routine pauses 
in the interview were described as interminable by interviewers. Undoubtedly there are many 
other contributing factors: even the absence of visual distractions may be important. 

Although SCPR’s experimental studies did not carry any open-ended items, the MRDF 
study included a number of spontaneous awareness measures. Comparisons of telephone 
and face-to-face results appear consistent with the findings discussed above. One example 
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Table 4 
Differences in Marginal Distributions of Response: 
Telephone vs. Face-to-Face 


Study 1 Study 2 
Bases (95) (69) 
% % 
No significant difference poe 87 
Significant at 5% | 
Significant at 1% 2 4 
Table 5 
Interview Length by Mode of Interview (Study 2) 
Telephone Face-to-Face 
Unweighted Bases (354) (360) 
% % 
Minutes 
Under 20 10 5 
20-29 63 53 
30-40 22 33 
40+ 6 8 
KOMEN VT 6 dost! = "3p = ‘0.01 
Table 6 


Comparisons of Responses on an Open Question (MRDF Survey) 


What do you like about ... soup? 


Telephone Face-to-Face 
Bases (700) (601) 
% % 
Number of answers 
None 33 22 
One 58 61 
Two i 14 
Three or more 1 2 
Average Os 0.96 


Mv '32.2 sdiovfis Bopi<i001 
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is given in Table 6, which shows that a third of telephone respondents gave no answers, com- 
pared with under a quarter face-to-face. Also, the average number of responses given over 
the telephone was significantly lower. 

We might assume that more or longer answers mean more valid reporting, and this would 
imply a need for techniques to improve open questions on telephone surveys. At the extreme, 
it might be concluded that open questions have only limited use on telephone surveys, for 
example when only the first information spontaneously offered by respondents is wanted. 
This assumption needs, however, to be tested: here we can only report the effect. 

Differences between response distributions have also been reported for attitude scale ques- 
tions asked identically face-to-face and over the telephone. Telephone respondents tend 
towards ‘‘acquiescence’’ and ‘‘extremeness’’ response bias (Jordan, Marcus and Reeder 1980; 
Groves and Kahn 1979). With the agree/disagree scales used by MRDF, the telephone sam- 
ple showed a slight tendency to agree more. However, no difference in the spread of responses 
was found - there was no evidence of a greater tendency towards extremeness. 


3.1.3 Sensitive Questions 


Concerning the types of question that can be used in telephone surveys, researchers have 
paid much attention to sensitive questions - those that deal with private or personal infor- 
mation and those for which certain responses are more clearly socially acceptable. Initial 
views about the likely effects of asking sensitive questions over the telephone were divided. 
Those who felt that respondents would be less willing to answer truthfully said that the lack 
of the interviewers’ reassuring presence would make respondents less likely to be frank and 
open. The opposite view - that respondents would give more valid answers - maintained 
that greater social distance, by preserving anonymity, would encourage truthful responses. 

Most evidence supports the latter view (Colombotos 1965; Wiseman 1972; Henson, Roth 
and Cannell 1974; Locander 1974; Rogers 1976). The major exception is reported by Groves 
and Kahn (1979), who found telephone respondents to be reticent about their financial status 
and other sensitive issues. 

Our studies support the hypothesis that telephone surveys work well for sensitive ques- 
tions. For instance, in our first study 14 questions were isolated as potentially sensitive and 
tested for mode-effects. Three illustrative examples of such questions are given below: 


i) How would you describe yourself?: 
(Read out) ... 
.. aS very prejudiced against people of other races 
. a little prejudiced 
. Or, not at all prejudiced? 

ii) Do you think, on the whole, that Britain gives too little or too much help to Asians 
and West Indians who have settled in this country, or are present arrangements about 
right? 

iii) Finally in this section, I would like you to tell me whether, in your opinion, it is accep- 
table for a homosexual person to be a teacher in a school? 


No significant differences in the marginal distributions of response were found. For several 
questions, however, there was a somewhat greater tendency to give socially desirable answers 
in face-to-face contact. In other words, the questions seemed to be less sensitive over the 
telephone. For example, 28% of respondents interviewed by telephone admitted to having 
been questioned by police over the past two years in connection with a crime, compared with 
20% of face-to-face respondents. 
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Sensitive questions in the MRDF study also showed a slight tendency for telephone 
respondents to give more ‘“‘honest’’ answers, although on individual questions differences 
in the distributions were generally not significant. For example, when asked to describe 
themselves on a number of dimensions, telephone respondents were more likely to say 
they were ‘‘attractive’’ (mean score of 2.81 out of 4 compared with 2.72 face-to-face) 
and were more ready to give an answer at all (88% gave an answer compared with 75% face- 
to-face). 

Questions about income have generally been regarded as potentially problematic in 
telephone surveys, both in respondents’ willingness to answer and in the answers given. Under- 
reporting of income levels is the main expectation, although in practice this may be hard 
to distinguish from under-estimation resulting from higher non-response in the upper income 
brackets. A study by Locander and Burton (1976) suggests that the validity of income data 
may depend on the question format. In a comparison of four question formats, under- 
reporting of income resulted from a method that first asked ‘‘Is your income more than 
$2,000?’ gradually increasing the figure until the first ‘‘no’’? response. However, over- 
reporting of income was encouraged by a similar method that began with the highest income 
category. The method used for the telephone surveys in the SCPR experiments was similar 
to the first type described above. It most closely approximates the response task set by the 
face-to-face income question in which a card indicating broad income bands, starting with 
the lowest, was used to guide the respondents’ choice. Over the telephone, the ranges were 
read to respondents starting at the lowest levels. The results are shown in Table 7. 

In neither study was there any mode difference in respondents’ willingness to answer the 
income question. Differences in the distribution of answers, in this case a possible under- 
reporting of income, were only apparent in the first study. 


3.1.4 Complex Questions 


In both SCPR studies a number of questions were identified in advance as likely to pose 
particular response problems for telephone respondents. These included questions with one 
or more potentially difficult concepts, long questions and questions with large numbers of 
response options. Such ‘‘complex’’ questions appear to be no more problematic for telephone 
respondents than for those interviewed in person. For example, of 19 ‘‘complex’’ questions 


Table 7 
Gross Household Income: SCPR Studies 
Study 1 Study 2 
Telephone Face-to-Face Telephone Face-to-Face 
Bases? (183) (170) (297) (352) 
Income 
less than £5,000 38 27 28 28 
£5,000-£9,999 42 37 37 38 
£10,000 or over 21 35 35 35 
Study 1: X* = 10.08 d.o.f. = 2 p < 0.01 
Bandy 22. X27) = 04 Ledioifiti= Dupe>.0:9 
Bases (217) (199) (344) (405) 
en irknow 16% 15% 14% 13% 


Not answered 


4<Tyon’t know’’ and ‘‘Not answered’’ excluded. 
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identified on the first study (12 of which had been asked with the aid of show-cards face-to- 
face), only one showed any evidence of mode-effects. 


4. SUMMARY AND CONCLUSIONS 


Since telephone ownership in the United Kingdom remains relatively low, particularly for 
certain sectors of the population, telephone interviewing is unlikely to replace face-to-face 
interviewing for surveys that must include the less advantaged. But its potential in combina- 
tion with traditional face-to-face procedures has gained recognition. For example, the U.K. 
Labour Force Survey uses telephone interviewing for second and subsequent interviews with 
eligible respondents who have indicated a willingness to be contacted by telephone. 

Crucial to the success of dual-mode surveys is the absence of differential mode effects. 
The results reported here provide a largely optimistic outlook. With a few exceptions there 
were no statistically significant differences between the distributions of answers obtained face- 
to-face and those given over the telephone. 

However, the relatively low response rates to telephone surveys poses problems that need 
to be overcome. High refusal rates can reduce the cost-effectiveness of using the telephone. 
More importantly, they increase the chances of introducing bias into the sample. Further 
research to explore ways of improving telephone response rates is necessary to realize the 
potential of the method in the United Kingdom. 
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Issues in the Use of Administrative Records for 
Statistical Purposes 


G.J. BRACKSTONE! 


ABSTRACT 


Demands for statistics on all aspects of our lives, our society and our economy continue to grow. At 
the same time statistical agencies share with many respondents a growing concern over the mounting 
burden of response to surveys. One result of the search for alternative methods of satisfying statistical 
demands has been an increased emphasis on the use of administrative records for statistical purposes. 
This paper reviews recent experience at Statistics Canada in this area and discusses obstacles to the 
greater use of administrative records. Approaches to rendering administrative systems more useful for 
statistical purposes are reviewed, together with some important concerns related to information pro- 
tection and record linkage. 


KEY WORDS: Indirect estimation; Survey frames; Survey evaluation; Access; Confidentiality. 


1. INTRODUCTION 


Demands for statistics on many aspects of our lives, our society, our economy and our 
environment continue to grow. This may be due in part to our increased ability to handle 
and manipulate large sets of data as we move into the so-called information age, and it may 
also be a reflection of the increasing complexity of our social and economic systems and 
our desire to understand them better. Whatever their cause we face these demands in a climate 
of tight budgetary constraint for government statistical agencies. At the same time, statistical 
agencies are sensitive to the increased burden that would be imposed on respondents by an 
increase in survey-taking activity to meet these demands. 

These factors have led to the exploration of other means of satisfying these statistical 
demands. Prominent among these alternative means is the increased use of existing ad- 
ministrative systems as sources of statistical data. This is not a new idea. For many years, 
statistical data have been a by-product of administrative processes in domains such as vital 
statistics, imports and exports, health care, and education. We will describe later how this 
usage of administrative data has spread more recently to statistics on businesses and on families 
and individuals. 

The first sections of the paper describe the variety of types and uses of administrative 
records, illustrating some of their uses in Statistics Canada’s program. The heavy depen- 
dency of Canada’s statistical system on administrative records will be apparent. Section 6 
discusses issues of accessing administrative sources and making them more appropriate for 
statistical use. Finally, a brief review of privacy concerns related to administrative record 
use is provided. 


2. TYPES OF ADMINISTRATIVE RECORD 


Administrative records come in many shapes and sizes. An important distinction is be- 
tween those administered nationally (usually by the Federal Government) and those 
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administered sub-nationally (e.g., by provinces or municipalities). For the latter to be useful 
nationally, agreement between jurisdictions is required on items such as definitions, stan- 


dards, 


record formats, and procedures. Such agreement is not always easy to achieve, par- 


ticularly in domains that are constitutionally within provincial jurisdictions. 

Administrative records vary in terms of their purpose, and their purpose is a prime deter- 
minant of their coverage and quality, and therefore of their statistical usefulness. Six broad 
categories of purpose can be distinguished. 


(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


Records maintained to regulate the flow of goods and people across borders. 


These include records of imports, exports, immigration and emigration. The coverage 
and content of the resulting administrative records depend on the particular laws and 
regulations to be enforced, and on the success of their enforcement. Typically such 
laws are well enforced. Immigration records, by definition, exclude illegal immigrants 
but otherwise are complete. However, since emigration from Canada is not controlled, 
no direct administrative emigration records exist. Administrative records on Cana- 
dian imports tend to be more accurate than those on exports since the former require 
more detailed documentation in order to assess their liability for duty. 


Records resulting from legal requirements to register particular events. 


Examples include births, deaths, marriages, divorces, business incorporations or 
amalgamations, licensing, etc. Typically coverage and quality of records collected for 
this purpose are very high in Canada, since evidence of this type of registration is 
necessary to obtain rights or benefits. 


Records needed to administer benefits or obligations. 


Examples include taxation, unemployment insurance, pensions, health insurance, and 
family allowances. The coverage and content of these records are highly program depen- 
dent. The population to which they apply may be very well covered, but for political 
or administrative reasons the definition of this population may not be the most useful 
definition analytically. 


Records needed to administer public institutions. 


These include, for example, records related to schools, universities, health institutions, 
courts and prisons. Such records tend to focus on the institutional caseload rather 
than on the individuals passing through the institution. On the other hand, they usually 
provide very complete aggregate statistics on the population using these institutions. 
In Canada, many administrative records in this category fall within provincial 
jurisdiction. 


Records arising from the government regulation of industry. 


Examples include records in the areas of transportation, banking, broadcasting and 
telecommunications. They also include records arising from the management of the 
supply or the price of some commodities, especially in the agriculture area. 


Records arising from the provision of utilities. 


These include electricity, phone and water services. Their coverage of subscribers and 
the quality of information associated with services and billing are normally good. Many 
of these services are administered at the provincial or municipal levels. 
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Administrative records also vary in terms of the processes by which they are assembled. 
Most administrative processes with wide coverage are now automated, but differences in hard- 
ware and data formats (both between jurisdictions, and between the administrative agency 
and the statistical agency) have to be faced. Increased automation also leads to an increasing 
amount of modification to the originally reported records by the administrative agency before 
they are received by the statistical agency. While enhanced control of the quality of incom- 
ing forms may be beneficial to the final quality of the administrative file, additional work 
is required by the statistical agency to understand and evaluate the effects of any preliminary 
processing carried out by the administrative agency. In some administrative systems, the 
individual records remain at their local source and only aggregates are assembled centrally. 
This practice restricts the statistical agency’s ability to evaluate the quality of the data and 
limits flexibility in statistical analysis of the data. 

Finally, records differ in terms of their accessibility. Legal and regulatory provisions often 
govern access to, and use of, administrative records for secondary, including statistical, pur- 
poses. This topic is addressed further in Section 6. 


3. USES OF ADMINISTRATIVE RECORDS 


The statistical uses of administrative records may be categorized into four main areas. 
Most statistical applications of administrative records fall into one of these four categories 
or represent combinations or variations of these uses. 


(1) Direct Tabulation 


This includes the counting of units in files, cross-classification by attribute, and the 
aggregation of quantitative variables associated with each unit. Statistics on vital events 
and on external trade are important examples. Other examples include the publica- 
tion of monthly counts of unemployment insurance claimants, and of beneficiaries 
by province, age, sex and length and type of benefit, and annual summaries of income 
distributions for each county based on the personal income tax file. 


(2) Indirect Estimation 


This category includes cases where data from administrative records comprise one of 
the inputs into an estimation process. For example, individual tax returns for the same 
taxfiler are linked from one year to the next in order to produce partial estimates of 
migration which can be weighted up with reference to census-based benchmarks. These 
estimates of migration then feed into Statistics Canada’s population estimation pro- 
gram (which also makes use of administrative data on births, deaths and immigra- 
tion). A second example is the use of taxation data for small businesses in lieu of seeking 
survey data from them. These tax-based data, adjusted if necessary, are combined 
with survey-based data for large businesses to provide industry aggregates. 


Also within this category are uses that involve the linkage of different administrative 
or statistical files to produce estimates. For example, the linkage of the death register 
with files of individuals exposed to particular hazards in order to estimate differential 
mortality rates, or the linkage of records from tax files, unemployment insurance files, 
and manpower training files in order to analyse labour market attachment and 
adjustment. 
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(3) Survey Frames 


In this category we include the use of administrative records to create, supplement 
or update frames to be used for censuses or surveys. A primary example is the use 
of payroll deduction information submitted by employers to Revenue Canada. The 
questionnaire which has to be completed by new payroll deduction account holders 
is a valuable means of identifying new businesses or changes in the structure of existing 
ones. Although in Canada we do not have a register of housing units, a second example 
would be the use of building permits or new telephone or electricity connections as 
signals of possible new housing units. 


(4) Survey Evaluation 


This category covers the use of administrative records for checking, validating or 
evaluating survey-derived data. This-may be done either at the individual unit level, 
or at an aggregate level. Several census evaluation studies in the past have used 
immigration and taxation records to evaluate census questions on immigration and 
income, respectively, while family allowance records have been used in checking the 
census coverage of children. 


An important determinant of how a particular administrative source will be used is the 
perceived quality of the administrative records compared to corresponding survey informa- 
tion. In some instances administrative records are used to evaluate survey responses, while 
in others survey-based data provide a means of benchmarking administrative-based estimates. 
The quality of administrative records has to be assessed in each individual case. In general, 
their quality for statistical purposes depends upon at least three factors: 


(i) the definitions used within the administrative system; 
(ii) the intended coverage of the administrative system; 


(ili) the quality with which data are reported and processed in the administrative system. 


Weaknesses in any of these three factors can affect the statistical usefulness of the 
administrative records. The timeliness with which they are available is also an important con- 
sideration. Some of the potential limitations that need to be considered when deciding on 
the statistical use of administrative records have been described elsewhere (e.g., see Brackstone 
1984). The strengths and weaknesses of administrative records compared to those of cen- 
suses and surveys are summarized in Table 1. 

To illustrate the utilization of administrative records in Canada we will describe two areas 
of application within Statistics Canada. The first deals with the production of business 
statistics; the second addresses the production of statistics on individuals and families. 


4. ADMINISTRATIVE DATA AND BUSINESS SURVEYS 


Statistics Canada is currently in the throes of a complete redesign of the infrastructure 
and strategy on which its business surveys program is based. In particular this involves the 
redesign of the business register (the frame for business surveys), the re-thinking of the role 
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Comparison of Censuses, Surveys and Administrative Records as Sources of Statistical Data 


Factors 


. Coverage 


. Content 


. Concepts/definitions 


. Small area estimates 


. Quality control 


Cost 


. Frequency 


Timeliness 


. Stability 


. Respondent burden 


Aim at complete coverage 
of the population 


Wide range of data items 
allows extensive 
cross-classification 


Can be based on the 
requirements of social 
and economic analysis 


Available as a result of 
aim at complete coverage 


Can be designed to 
minimize errors 


Expensive 


Every 5 or 10 years 
(depending on topic) 


Data available six 
months to 2% years after 
Census Day 


Changes are under the 
control of 

statisticians who 
respond to user needs 


Heavy but infrequent. 
Reduced through the use 
of sampling 


Censuses Surveys 


Some surveys exclude 
certain sectors of the 
population (e.g., Indian 
reserves, remote areas) 


Usually covers a narrow 
range of topics but in 
more depth than a census 


Can be based on the 
requirements of social 
and economic analysis 


Unavailable in most 
cases 


Smaller size allows for 
even tighter control 
than in censuses 


Relatively low cost per 
survey, although the 
cumulative cost of a 
regular survey over a 
5-year inter-censal 
period may be large 


May be annual, quarterly 
or monthly depending on 
topic 


Repeated regular surveys 
produce results in a few 
weeks. Ad hoc surveys 
may require several 
months 


In repeated surveys, 
changes are infrequent 
to allow comparisons 
over time 


Light on average, though 
heavy for those selected 


Administrative records 


Target populations are defined 
by administrative requirements 


Restricted to variables 
required for administrative 
purposes 


Defined by administrative 
requirements 


Available, provided individual 
records are geographically 
coded to small areas 


Under the control of the 
administrative agency and may 
not receive attention except 
for key variables 


Relatively inexpensive if 
initial collection costs 
attributed to the 
administrative program 


May be annual or monthly 
depending on administrative 
program 


Dependent upon the adminis- 
trative process. An annual 
file may not be available in a 
clean form until well into the 
following year 


Changes may occur due to 
legislative or regulatory 
change, or due to changes in 
administrative practice 


No additional burden 
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and use of tax data within the program, and the development of a consistent strategy for 
the design of both annual and sub-annual business surveys. This redesign was motivated by 
needs to: 


(a) overcome some noticeable data quality weaknesses in the current program; 
(b) better integrate data from different surveys; 
(c) minimize respondent burden by making maximum use of tax data; 


(d) reduce resources required for maintaining survey frames. 


A more detailed description of this project can be found in Colledge (1987). 

Income tax and payroll deduction data play a prominent role in the conduct of business 
surveys. Annual tax returns submitted by corporations (T2) and by individuals (T1) are 
available to Statistics Canada under the Statistics Act. The payroll deductions of income 
taxes by employers are also available. Statistics Canada makes use of these data from business 
for two distinct purposes: 


(i) maintenance of its frame of businesses; 


(ii) substituting income tax data for survey data. 


4.1 Frame Maintenance 


The maintenance of a frame of businesses is a complex task. This complexity stems 
primarily from the complex structure and inter-relationships of many businesses, particularly 
large ones, and from the difficulty of keeping track of the very large number of births and 
deaths occurring among small business. The term ‘‘business’’ itself needs careful definition. 
In fact a distinction must be made between legal structures (incorporated companies, etc.), 
operating structures (the way companies organize and operate themselves), and statistical 
structures (the units for which data are required for analytic purposes). A hierarchy of units 
can be defined within each of these structures. In the case of the statistical structure, Statistics 
Canada has defined a hierachy comprising, from top down, enterprises, statistical companies, 
establishments and locations. The task of frame maintenance thus involves not only updating 
for births and deaths but also keeping track of changes in the relationships between the various 
units within complex businesses, including the relationships between the statistical and 
operating hierarchies. 

The proposed frame strategy calls for the continuous maintenance of the current corporate 
structure of all companies above a certain threshold size (which varies with industry), including 
the relationship of this structure to tax reporting units. Companies updated in this way will 
account for at least 70% of economic activity in each industry. 

An activity known as ‘‘profiling’’ is used to determine the internal structure of complex 
businesses. This involves interviewing officers of the business to understand their operating 
structure and identify the appropriate statistical units. An important source of information 
on changes to business (births and restructuring) is Revenue Canada’s payroll deduction (PD) 
system. The activation of a new PD account by an employer is treated as a signal that 
something has happened. Such signals are followed up with the business to identify whether 
a frame update is required. Other signals wil! be obtained from annual tax returns, from 
responses to regular surveys, and from routine profiling. 

In the case of smaller companies, where the structure is usually simpler but the turnover 
is faster, no attempt is made to define the various types of unit and their inter-relationships. 
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Instead, administrative data are used directly. Two alternative lists of businesses are made 
available as a basis for surveying — one is the most recent set of annual tax returns; the second 
is the current set of PD accounts. In both cases, all units above the threshold are removed. 
These two lists overlap and the most appropriate one is used in each particular survey. The 
PD-based list, which is more current since PD accounts may be opened or closed at any time 
during the year, is preferred for sub-annual surveys. It has the disadvantage of excluding 
non-employers. 


4.2 Substituting for Survey Data 


In the interests of minimizing both response burden and costs, tax data are used to replace 
survey data where feasible. The concepts and definitions underlying tax data do not uniformly 
coincide with the survey definitions required to assure consistency in the System of National 
Accounts or for other analytic purposes. Therefore care has to be taken in selecting from 
tax returns the data items that come closest to the required survey definitions. Furthermore, 
tax data do not contain the full range of variables required by many annual business surveys. 
In particular, they lack production statistics. 

A further problem in utilizing tax data lies in establishing the relationship between the 
unit for which a tax return is submitted and the unit(s) to be surveyed. This is a problem 
particularly for the large complex businesses referred to earlier. 

The strategy that has been developed for annual surveys is to make use of tax data primarily 
for small businesses where there is usually a one-to-one relationship between the taxfiler and 
the business. This approach significantly reduces the response burden on small businesses, 
without unduly affecting the quality of final data, since the bulk of economic activity is 
reported through the survey returns of larger companies. 

It is clear from this brief overview of the new business survey strategy and infrastructure 
that there is a fundamental dependence on tax data for the continuing functioning of the 
program. This requires a very close working relationship between Statistics Canada and 
Revenue Canada so that the impact of administrative and procedural changes in the tax system 
can be assessed and prepared for in advance. 


5. SOCIO-ECONOMIC DATA FROM ADMINISTRATIVE SOURCES 


A systematic effort to develop data on individuals, families and households from 
administrative records was initiated in the late 1970s. The original motivation for this work 
was the rising costs of census-taking and the search for cheaper alternatives. It quickly became 
apparent that the statistical potential of administrative records on individuals in Canada lay 
in supplementing the quinquennial census through the provision of data for small areas inter- 
censally, rather than in replacing the census. It is not possible to achieve the coverage, 
geographic specificity, and range of individual, family and household characteristics required 
from a census with the existing administrative record systems. Nevertheless, the emulation 
of census coverage using a combination of administrative record systems is being pursued, 
together with the study of the possibility of replacing some census questions with data derived 
from administrative sources. 

This section will concentrate on the use of administrative records to supplement census 
data inter-censally. The focus of the developmental work has been on administrative record 
systems that are national in scope (e.g., income tax, unemployment insurance (UI), family 
allowance, old age security) rather than systems that are administered at provincial or lower 
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levels (e.g., health insurance, driver’s licences, municipal assessments). In the latter case the 
problem of standardization across jurisdictions is added to the other problems inherent in 
the statistical use of administrative records. 

The annual individual tax file (T1) has proven to be the principal source of statistical data 
on individuals. The first use of this file was its direct tabulation to produce statistics on income 
and labour force participants by age and sex for provincial and sub-provincial areas. Iden- 
tification of geographic location of taxfilers is based on the postal code indicated on the 
record. A file that provides a conversion from postal code to the various levels of census 
geography (province, county, municipality, electoral district, etc.) has been developed. Special 
tabulations can also be produced for user-defined areas described in terms of postal codes. 

Data derived in this way are, of course, based on the concepts, definitions and regula- 
tions implicit in the Income Tax Act. These may not conform to definitions desired for analytic 
purposes (e.g., some forms of social assistance which are not taxable may be excluded). Income 
can be broken down by source - in particular, employment income can be separated. Variables 
available for cross-classification are limited (e.g., age, sex and marital status). Occupation, 
though asked on the tax form, is not reported nor coded with sufficient quality to be 
statistically useful. The coverage of these data is limited by the need to file a tax return. 
Low income individuals and dependents are therefore under-represented. Over time, changes 
to tax law can have a significant impact on coverage; e.g., the introduction of the Child 
Tax Credit, that required low income earners to file a tax return in order to claim the credit, 
led to a marked increase in coverage in 1978 compared to the previous year. 

Despite these reservations, data produced by direct tabulation from income tax files pro- 
vide a useful inter-censal source of small area income data. A recent publication from Statistics 
Canada made use of this source to produce data for Forward Sortation Areas, i.e., the first 
three characters of the postal code (Statistics Canada 1987). Since a prime concern in the 
publication of data for small areas is to ensure that no individual data can be deduced from 
aggregate totals for small areas, data are not provided for areas with less than 100 taxfilers. 

A second use of the individual tax file is for estimating annual migration. This is achieved 
by matching individuals on tax files for two successive years and comparing the Census Divi- 
sion (or county) code assignment for each year. If there has been a change in code, it is assumed 
that the taxfiler has migrated. Demographic and tax exemption information are used to 
estimate the total number of persons who have migrated with the taxfiler. In a final stage, 
since the tax file does not cover the whole population, an adjustment is made to estimate 
the total number of migrants from year to year. Since 1981, tax-based migration estimates 
have been used in Statistics Canada’s population estimates program. A full description of 
the methodology for estimating migration from tax records can be found in Norris and 
Standish (1983). 

While data on individual incomes can be derived from tax data as described earlier, more 
analytic and policy interest focuses on family income. To derive family income from the 
individual tax file requires the capacity to identify and match records of individuals belong- 
ing to the same family. Development of family income data in this way has been proceeding 
with encouraging results. A description of methodology and results can be found in Auger 
(1987). 

A second important administrative source of data on individuals is the unemployment 
insurance (UI) system. Files of both claimants and beneficiaries are available to Statistics 
Canada. The UI claimant and beneficiary files contain individuals who, for a variety of 
reasons, may be entitled to UI benefits. Not all of these individuals are considered to be 
unemployed according to the standard international definition of unemployment as incor- 
porated in the Labour Force Survey (LFS), the source of published unemployment rates. 
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If a closely corresponding category in the UI system can be found, these files can be used 
to tabulate counts of ‘‘unemployed’’ for small areas. However, since even the best choice 
of category in the UI system does not correspond exactly with the definition of ‘‘unemployed”’ 
used in the LFS, attention has to be focused on how to integrate or reconcile these two sources 
of data. For example, monthly counts for small areas from the UI system might be used 
as indicators of changes in unemployment at the local level which could be calibrated to reliable 
LFS estimates at a higher geographic level (e.g., the province). Various methods of estima- 
tion along these lines have been investigated (e.g., regression estimation, SPREE - structure 
preserving ratio estimation), though without as yet any final conclusion as to the most 
appropriate method. A description of this work can be found in Trottier and Choudhry (1985) 
while Feeney (1987) describes a similar approach in the Australian context. A time series 
modelling approach which exploits the correlated structure of the error over time appears 
very promising (Choudhry and Hidiroglou 1987). 

These examples have illustrated that, in the case of statistics on individuals, the primary 
uses of administrative records are for direct tabulation and as input into estimation processes. 
This contrasts with the examples from the business side where frame maintenance and substitu- 
tion for survey responses were the main uses. 

While these two examples represent two important developing areas of administrative record 
use in Statistics Canada, they cover only a small fraction of the administrative files used 
by the Agency. There is, for example, a widespread and long-standing use of administrative 
records in the social institutions area (education, health, justice) both for creating survey 
frames and for obtaining statistical data. Current developmental work on telephone survey- 
ing and on address registers is using administrative records to develop frames of dwellings 
or households. A recent internal survey identified more than 50 administrative systems being 
used for statistical purposes. These covered the full range of types and uses described in Sec- 
tions 2 and 3, and included examples from areas as varied as disease registries, motor vehicle 
licences, aircraft landings, milk marketing boards, fuel sales tax, municipal construction 
records, and customs and excise. 


6. ACCESSING AND INFLUENCING ADMINISTRATIVE SYSTEMS 


It is clear from this review of the use of administrative records for statistical purposes, 
that administrative records are a vital input to many of Statistics Canada’s programs. This 
leads to a consideration of measures the Agency can take to protect the supply of data from 
administrative sources, and perhaps to make them more useful for statistical purposes. In 
this section we will deal with the two primary issues of obtaining access to administrative 
records, and influencing their content, design or associated procedures. 


6.1 Access 


The legal authority for access to administrative records is provided by Section 12 of the 
Statistics Act (1971): 


‘*A person having the custody or charge of any documents or records that are main- 
tained in any department or in any municipal office, corporation, business or organiza- 
tion, from which information sought in respect of the objects of this Act can be obtained 
or that would aid in the completion or correction thereof, shall grant access thereto 
for those purposes to a person authorized by the Chief Statistician to obtain such infor- 
mation or such aid in the completion or correction of such information.’’ 
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While this provision appears to give fairly broad access rights, it is not without limita- 
tions. In some cases, legislation governing the administrative process places restrictions on 
access or secondary use of the administrative data. This leads to a confrontation of legisla- 
tion that will at best delay the negotiation of access. In some cases, access for statistical pur- 
poses is specifically permitted. 

Enabling legislation is a necessary but not sufficient condition for the productive utiliza- 
tion of administrative records. A co-operative approach to the development and utilization 
of administrative records for statistical purposes is likely to be far more effective in obtain- 
ing access to administrative records than an approach involving legal arguments and sanc- 
tions. Indeed, once access is obtained, the subsequent step of influencing design or procedures 
is only achievable if there is a spirit of co-operation between the administrative and statistical 
agencies. 

Access to administrative records by Statistics Canada is strictly a one-way street. Individual 
micro-data are provided from the administrative agency to the statistical agency, but only 
confidentiality-protected aggregate data can flow back. The only exception to this rule is 
the case where the administrative agency depends on the statistical agency to organize, for- 
mat, edit, process, or restructure its records, and a version of the original micro-data is passed 
back to the supplying agency. 


6.2 Influencing Change 


We have already alluded to the potential impact of changes in administrative regulations 
or practices on resulting statistics. Discontinuities in time series based on administrative records 
can be caused by simple changes in the coverage of a program, the introduction of an incen- 
tive to join or leave a program, or procedural changes that affect quality or completeness 
of records. Thus the statistical agency has to guard against, and react to, externally imposed 
changes. 

There are other kinds of changes that the statistical agency might like to see implemented. 
A frequent frustration of the statistician trying to use administrative records is the feeling 
that the administrative records could be so much more useful if only relatively minor changes 
were made. For example, the addition of an extra question, the use of a different concept, 
the coverage of an additional subgroup, or the introduction of a quality check might 
significantly enhance the statistical value of the records. On the other hand, why should the 
administrative agency contemplate changes not required for the primary administrative pro- 
cess, changes which would probably in some measure add to the cost and complexity of the 
administrative process? 

The challenge for a statistical agency is to persuade the administrators that the benefits 
from such a change outweigh any additional administrative costs. This is made harder to 
the extent that the benefits do not accrue to the department responsible for the administrative 
system, but to separate policy-making departments and other statistical users. 

It is usually easier to build statistical requirements into a system from its inception than 
to make changes to a system that is already operational. Therefore, a mechanism that would 
allow statistical requirements to be considered during the design, or the major redesign, of 
an administrative system is preferable to one that only tries to adjust existing systems. A 
topical case in Canada is in the area of tax reform, currently under consideration by the 
government. This could significantly change the collection of business data in Canada. Involve- 
ment of statisticians in the design of such a system could greatly enhance the statistical benefits 
derived from the system. Of course, the institution of a new administrative system is a relatively 
rare occurrence, so that adjustment to existing systems is also necessary if statistical benefits 
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are to be obtained in the short run. On the other hand, the comparative rarity of design 
or redesign of major administrative systems strengthens the argument for not missing oppor- 
tunities to influence such exercises when they do arise. 


6.3 Mechanisms 


A variety of measures or mechanisms, some bilateral involving the statistical agency and 
a specific administrative department, others of a broad government-wide nature, can assist 
the statistical agency in accessing and influencing administrative systems. These include: 


(i) bilateral committees at a senior level to review and discuss issues of mutual interest, 
including problems related to the supply of administrative data; 


(ii) feedback of statistical data to the administrative agency to demonstrate both 
usefulness of the data and, perhaps, weaknesses arising from administrative practices; 


(iii) provision of technical advice or services in support of the administrative agency’s 
own statistical activities; 


(iv) a government information collection policy that requires, for example, any data col- 
lection activity plan (statistical or administrative) to be reviewed by a central agency; 


(v) statistical planning in the form of a requirement that each new program proposal 
include a plan for acquiring the statistical information needed to monitor and evaluate 
the program; 


(vi) promotion of the use of standard statistical definitions (e.g., family, business 
establishment, unemployed) in administrative systems; 


(vil) audits that identify the use of administrative records as a cost-efficient alternative 
to other means of acquiring information; 


(vill) political instruction to make greater use of particular administrative systems or seek 
alternatives to survey-taking; 


(ix) removal of legislative impediments to access or use of administrative records for 
statistical purposes. 


Statistics Canada’s experience in dealing with other federal government departments has 
been most successful in cases where close bilateral arrangements have been developed. The 
introduction of senior bilateral committees in the early 1980s was supportive of such 
arrangements, and in some cases instrumental in creating them. Government-wide measures 
such as information management and statistical planning have been less successful in 
facilitating administrative record use. Government audits and cabinet directives have pro- 
vided impetus to activities aimed at increasing administrative data use, but the increased use 
itself is again dependent upon close working relationships with particular departments. While 
it is convenient to characterize the statistical agency as the progressive agency trying to break 
down unreasonable barriers to administrative data use, it must also be recognized that there 
may be inertia to the associated changes within the statistical agency itself. Staff whose careers 
have been based on survey design and survey-taking may need convincing that budgetary 
restrictions and data needs now necessitate combining these with other approaches. 

Since the above comments have focused on federally administered systems, we will add 
a few words about provincial records. While some of the above measures apply equally to 
provincially administered records, the fundamental problem in dealing with subnational 
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jurisdictions is that of adherence to common standards. Differing provincial needs and 
priorities, facilitated by increasing technological capacity, will lead to divergent administrative 
systems in the absence of any centralizing force. Statistics Canada has used a variety of 
mechanisms in the past in attempts to encourage conformity, but with only mixed success. 
As with federal government custodians of administrative records, mutual benefit has to be 
the major incentive to conformity. Federal-provincial committees exist in several subject areas. 
The Vital Statistics Council, consisting of provincial registrars of vital events and represen- 
tatives of Statistics Canada, is a successful and long-standing example. Such committees have 
developed and monitored conventions for reporting certain data items in the past. For exam- 
ple, the framework for municipal finance reporting was developed as a result of federal- 
provincial meetings on municipal financial statistics. 


7. CONFIDENTIALITY, PRIVACY AND PUBLIC RELATIONS ISSUES 


Even with the legal authority to exploit administrative records and co-operative 
administrative agencies to supply them, careful consideration has to be given to the public 
perception of the use of administrative records beyond their original purpose. Since the effec- 
tiveness, if not the survival, of a statistical agency depends critically upon the continuing 
co-operation and trust of respondents, it must take extreme care before embarking on any 
activity with the potential to undermine that co-operation or trust. 

Public awareness and concern over privacy and related issues of information access and 
control have risen in many countries in recent years. In Canada, passage of the Privacy Act 
in 1982 bore witness to this mounting concern. The Privacy Act requires, inter alia and with 
some exceptions, that an index of all personal information banks under the control of federal 
government institutions be published periodically, that individuals have the right of access 
to information about themselves contained in such information banks, and that personal 
information be used only for purposes consistent with the purpose for which it was obtained. 
One of the exceptions to this last provision is that personal information may be disclosed 


ce 


. to any person or body for research or statistical purposes if ... the purpose 
for which the information is disclosed cannot reasonably be accomplished unless the 
information is provided in a form that would identify the individual to whom it relates, 
and ... a written undertaking (is obtained) that no subsequent disclosure of the infor- 
mation will be made in a form that could reasonably be expected to identify the 
individual to whom it relates.’’ (Privacy Act 1982 Section 8(2)(j)). 


This provision covers the use of administrative records for statistical purposes as far as the 
Privacy Act is concerned. However, this Section is subject to any other Act of Parliament 
so that a clause forbidding such use in an Act governing an administrative process would 
have precedence. 

While the Privacy Act and other Acts recognize statistical work as a legitimate secondary 
use of administrative records under certain conditions, this alone will not allay public con- 
cern over the existence of data banks that could be used to an individual’s detriment. It is 
doubtful whether the average citizen appreciates the distinction between statistical use, where 
the identity of the individual record is of no lasting interest, and administrative use, where 
the essence of the individual record is the particular unit to which it relates. It would be easier 
to explain and utilize this distinction if we could state unequivocally that identifiers are never 
needed for statistical purposes. Unfortunately this is not the case. Several legitimate statistical 
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techniques do require identifiers in intermediate data manipulations. These techniques all 
involve some form of matching data from different files or different occasions, and iden- 
tification is required to ensure that the correct records are matched. Once the matching has 
been accomplished the records can be anonymized provided no subsequent linkage is planned. 
Examples include the requirement for names in a population census to ensure coverage and 
permit coverage measurement, longitudinal studies using administrative records, 
epidemiological investigations, and evaluation studies to check survey responses against 
administrative sources. Explaining why identifiers are needed when identity is of no interest 
is an interesting challenge facing the statistical agency. 

A further source of concern may relate to the undertaking of confidentiality itself. Despite 
Statistics Canada’s record of confidentiality protection there are doubtless respondents who 
are skeptical about the protection their information enjoys. This concern may be heightened 
by the use of enumerators who are known to respondents, particularly in small communities. 
Some respondents seem to assume there is a high degree of information exchange actually 
taking place between federal departments, and in some cases do not distinguish between dif- 
ferent departments of government. 

An additional concern may relate, not to the trustworthiness of the present custodians 
of information banks, but to a fear that personal information cannot be protected against 
future violation, either illegally, or by a legitimate elected authority with different views on 
privacy. Protection against this possibility would require the removal of all identifying infor- 
mation from statistical data bases. 

This public concern over privacy and the manipulation of personal information requires 
the statistical agency to consider measures it can take to prevent or minimize negative public 
reaction to its legitimate use of administrative records for statistical purposes. Since this is 
essentially an issue of public perception, it is important that the statistical agency be open 
about its practices, and that any of the following measures that are implemented are clearly 
visible to the interested public. 


(a) Public communications to respondents and users should continually stress the impor- 
tance attached to confidentiality of all individual (micro) data acquired by the statistical 
agency. 


(b) The one-way nature of micro-data flow should be stressed. Micro-data flow into the 
statistical agency, but only confidentiality-protected aggregates or summaries flow out. 
This applies equally to survey or census data and data from administrative records. 


(c) The benefits of administrative record use in terms of reduced respondent burden and 
savings to the taxpayer should be emphasized. Such claims should be supportable by 
real measures of cost and respondent burden savings. 


(d) An explicit and public policy on record linkage stipulating the conditions under which 
the statistical agency will undertake such activities can be helpful both in demonstrating 
careful consideration and control of linkage activities, and in forestalling linkage 
requests that would violate the conditions. 


(e) The Privacy Act requires that individuals be informed of the purpose for which any 
personal information is being collected. Administrative agencies should be encouraged 
to ensure that statistical purposes are included in such statements. Even though 
statistical purposes may be a permissible secondary use of administrative records, their 
explicit mention on the collection form will serve to avoid subsequent surprise. 
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(f) The physical security that surrounds the use of sensitive administrative records should 
be clearly visible, and perhaps even tighter than that in use generally within the Agency. 
For example, in Statistics Canada, the divisions having primary custody of tax data 
are housed in limited access areas within buildings that are themselves subject to security 
checks on entry. 


(g) Exemption of statisticial files from examination by security or intelligence services 
is an important element in maintaining public trust in the absolute confidentiality of 
data provided to the statistical agency. An exemption for Statistics Canada data (the 
sole institutional exception within government) was provided when the new Canadian 
Security and Intelligence Service was formed in 1983. 


While the above points represent some specific measures that can be taken to avoid or 
respond to public reaction to the use of administrative records, ultimately the statistical agency 
must have strong political support for this kind of activity. The political credit to be gained 
from demonstrated reductions in costs and respondent burden, coupled with strong political 
assurances of the protection of individual data, provide a strong platform for politicians 
to dispel public concern over the use of administrative records for statistical purposes. At 
the same time they must immediately and unambiguously confront and correct any sugges- 
tion that statistical records be used for administrative purposes. 


8. CONCLUSION 


Administrative records are and will continue to be an increasingly important source of 
statistical data. The relative strengths and weaknesses of data derived from administrative 
systems, in terms of cost, coverage, quality, relevance and timeliness, in comparison to census- 
or survey-based data, dictate the manner in which these sources of data are most effectively 
used. Current uses of administrative records include direct tabulation, indirect estimation, 
substitution for survey responses, frame construction and maintenance, and data evalua- 
tion. These uses now permeate most statistical programs and can be expected to extend even 
further in the future. 

In Canada, administrative records have become part of the fabric of our statistical system. 
Their use has been one of the means by which Statistics Canada has been able to maintain 
its programs in the face of declining budgets. In the process, respondent burden has been 
reduced and new, or more frequent, data series have become available. Since we do not have 
administrative registers as such, considerable attention has been paid to issues of coverage 
and the joint use of both administrative and survey-based data to ensure valid estimation 
of universe totals. The use of record linkage techniques, though requiring careful controls, 
has proven to be very valuable, particularly for business data, longitudinal labour market 
studies, and epidemiological work. 

With the growing use of administrative records, statistical agencies are becoming increas- 
ingly dependent upon other agencies for the uninterrupted flow of input data to their statistical 
programs. Whatever the legislative and policy environment in which the statistical agency 
operates, the establishment of close co-operative arrangements with supplying agencies is 
crucial. The ability of the statistical agency to influence the design or redesign of administrative 
systems rests on a mutual understanding of the requirements of the two agencies. Establish- 
ment of a government-wide policy or principle that the statistical agency should have a voice 
in decisions regarding the design of administrative systems, or more generally, in proposals 


Survey Methodology, June 1987 43 


for meeting the statistical needs of new programs, can help the statistical agency in this regard, 
but is no substitute for the fostering of close co-operation with administrative agencies. 
A variety of mechanisms can be considered to assist the statistical agency in gaining the 
access and influence it requires within the government system. The applicability and effec- 
tiveness of each mechanism will depend upon the underlying legislative and political climate, 
and on the mandate and status of the statistical agency within the government apparatus. 
Statistics Canada’s experience has been that close bilateral working relationships with 
administrative departments, based on a principle of mutual benefit, is the most effective 
approach. Political support for the use of administrative records is important and has been 
forthcoming through recent government decisions related to budget reductions. 
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Statistical Properties of Crop Production Estimators 


CAROL A. FRANCISCO, WAYNE A. FULLER, and RON FECSO! 


ABSTRACT 


The National Agricultural Statistics Service, U.S. Department of Agriculture, conducts yield surveys 
for a variety of field crops in the United States. While field sampling procedures for various crops 
differ, the same basic survey design is used for all crops. The survey design and current estimators 
are reviewed. Alternative estimators of yield and production and of the variance of the estimators are 
presented. Current estimators and alternative estimators are compared, both theoretically and in a Monte 
Carlo simulation. 


KEY WORDS: Crop surveys; Yield estimation; Two phase sample; Variance estimation. 


1. INTRODUCTION 


The National Agricultural Statistics Service (formerly known as the Statistical Reporting 
Service), U.S. Department of Agriculture, conducts objective yield surveys of corn, cotton, 
soybeans, rice, grain sorghum, sunflowers and wheat in states which are major producers 
of these field crops. Similar yield surveys are conducted in a number of other countries. 

While field sampling procedures for each crop differ in terms of plot sizes, plot location 
methods, and vegetative and fruit measurement techniques, all surveys rely on the same basic 
design. A four-step sampling procedure is used. A description of this survey design is con- 
tained in Section 2. Section 3 describes the estimators of average crop yield and the variance 
estimators, evaluates them and explores alternative estimators. Conclusions and recommen- 
dations are presented in Section 4. 


2. OBJECTIVE YIELD SURVEY DESIGN 


The first two steps of sample selection produce the sample of area segments used in the 
June Enumerative Survey conducted by the National Agricultural Statistics Service (NASS). 
The area frame for each state is stratified by land use. For example, the State of California 
is divided into 12 land use strata. Each land use stratum is subdivided into areas called frame 
units. The size of a frame unit varies; the actual size of any given frame unit depends upon 
available boundary designations, available ancillary information, political boundaries, and 
so forth. Once frame units are established, the number of area segments in each frame unit 
is determined by dividing the total area of each frame unit by the target segment size. The 
target size is a function of the land use stratum into which the frame unit falls. For example, 
in California the target segment size is one half square mile in the orchard stratum and one 
square mile in all other cropland strata. Frame units typically contain between one and 30 
area segments. 


' Carol A. Francisco, Syntex Laboratories Inc., 3401 Hillview Avenue, Palo Alto, California 94304; Wayne Fuller, 
Department of Statistics, lowa State University, Ames, lowa 50011; and Ron Fecso, Survey Research Branch, 
National Agricultural Statistics Service, U.S. Department of Agriculture, Washington D.C. 20250. 
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Each land use stratum is substratified on the basis of geography. To develop the geographic 
substrata, frame units within each land use stratum are ordered by county in such a manner 
that adjacent counties that are agriculturally similar are placed together (Fecso 1978). Substrata 
are formed from sequential groups of area segments. Thus, substrata contain area segments 
that are agriculturally similar and geographically close together. Within a given land use 
stratum, substrata have an equal number of segments and equal area (within rounding). 
Detailed information on the area frame design is available in Fecso and Johnson (1981) and 
Houseman (1975). 

For purposes of variance estimation, it is the substrata within land use strata that are the 
sampling strata. Henceforth, the land use substrata will be referred to simply as strata. 

The first step in sampling from the area frame is the selection of frame units within each 
stratum. The number of frame units allocated to a stratum depends on the agricultural nature 
of the stratum. Typically, eight to 15 frame units are drawn in cropland strata; whereas in 
agri-urban, city, and nonagricultural strata four to five frame units are drawn. Frame units 
within strata are selected at random with probability proportional to the number of area 
segments in the frame unit. At the second step, one area segment is chosen at random from 
each selected frame unit. Thus, each area segment within a stratum has an equal probability 
of selection. 

Although the frame unit is the primary sampling unit for this design, because the frame 
units are selected with probability proportional to the number of segments and one segment 
is selected per sampled frame unit, the segment can be treated as the primary sampling unit. 
In our study, steps one and two in the sampling procedure are considered as one procedure, 
and the sample of segments will be treated as a stratified single stage simple random sample. 
Since the average sampling rate is about one percent, the finite population correction term 
will be ignored in our analysis. 

The third and fourth steps in the sampling procedure involve the selection of fields and 
of plots within selected fields. As part of the June Enumerative Survey, all selected area 
segments are screened for fields which have been planted or are scheduled to be planted with 
the crop of interest. These fields are listed by segment number and order of enumeration 
within segment. A systematic sample of fields is selected with selection probabilities propor- 
tional to the product of the field area and the inverse of the probability of selection of the 
area segment in which the field is contained. Hence, the number of sampled fields per seg- 
ment varies, and large fields within a segment can be selected more than once. 

At the fourth and final step, two plots of roughly equal area are placed in each selected 
field using a random row and pace method of location. Where rows are not readily 
distinguishable, and in the case of wheat, a random number of paces along the field edge 
and a random number of paces into the field are used to locate plots. A further exception 
occurs in the wheat objective yield survey. For this survey the first plot is randomly located 
and the second plot is placed in a fixed position relative to the first plot. In the event that 
a large field is selected more than once during the third step of the sampling procedure, addi- 
tional sets of two plots are independently sampled. Because plots are always sampled in pairs, 
we call the pair of plots the secondary unit. A maximum of eight plots (that is, four secon- 
dary units) per field is imposed. 


3. ESTIMATION PROCEDURES 


Formally, the sample is a two phase sample with subsampling in the second phase. Table 1 
contains a schematic description of the sample. The phase one sample is a stratified simple 
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Table 1 
Sampling Procedure for the Objective Yield Survey 


, : Selection Sampled Data 
Phase/Sampling Unit Procedure Number! Collected 
Phase One 
Primary Sampling Unit: equal Nn, crop acres 

Segment probability 
within strata 
Phase Two 
Primary Sampling Unit: unequal K,, crop acres, 
Segment probability estimated 
production’ 
Secondary Sampling Unit: equal Max estimated 
Pair of Plots probability production 
from plots 


' Number is per stratum for primary sampling units and is per segment for secondary sampling units. 
2 Segment production is zero if the crop acreage is zero and is estimated from plot determinations if the crop acreage 
is positive. 


random sample of segments. The phase two sample is composed of all segments with zero 
crop acres and a probability-proportional-to-crop-acres sample of segments with the crop. 
The sample of segments is the result of a probability-proportional-to-area systematic sample 
of first phase fields planted with the crop. A sample of secondary units, where each secon- 
dary unit is a pair of plots, is selected from the segments in the phase two sample that have 
the crop. Because the secondary unit is always a pair of plots, we will henceforth refer to 
secondary units and no longer speak of plots. We will also ignore the fact that the opera- 
tional units used to locate the plots are fields and speak only of the sampled segments. 

Notice that two types of segments are observed at phase two - those that have zero acres 
of the crop and those that have non-zero acres. The total number of second phase segments 
is K. The acres and the total production are known (both equal to zero) for an observed 
segment with zero acres. For second phase segments with positive acres, a subsample of secon- 
dary units is used to estimate production. 

Let M,, be the number of secondary units in segment k of the A-th stratum. Without loss 
of generality, /,, could be assumed to be equal to A;,,, where A;, is the crop area in seg- 
ment hk. Equality requires only the choice of an appropriate scale for area. 

Section 3.1 examines the yield estimator that is currently used. Conditions under which 
this estimator is unbiased for state average yield are investigated. A simple estimator of the 
variance of estimated yield is discussed in Section 3.2. Estimators of the unconditional 
variances of the yield and production estimators are developed in Section 3.3. A Monte Carlo 
study of estimators is given in Section 3.4. 


3.1 Currently Used Yield and Production Estimators 


Estimates of the state average yield are currently computed as though the sample were 
an equal probability simple random sample of secondary units. The estimator is the simple 
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average yield of secondary units with positive acreages. That is, the estimated average yield 
per acre is 


Ls Nh Mhk 


P= D7" YY Lane Sree 3.1) 


h=1 k=1 61 


where 
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Mm), is the number of sampled secondary units selected in segment hk, L is the number of 
strata, and Y;,,) is the estimated yield per acre for secondary unit ? of segment hk. If the 
crop acreage in a segment, A,,, is zero, then m,, = 1 and Y,,. = 0, by definition. The total 
number of observed secondary units for segments with positive acres is D. 

Expression (3.1) can be written in the convenient operational form 


y=D"! Ds Yin (3.3) 


where the subscript ¢ replaces the triple subscript hk? and the summation is over secondary 
units in segments with positive crop acres. 

The estimator of average crop yield per acre (3.1) is a type of combined ratio estimator. 
This can be shown by using conditional selection probabilites to rewrite ¥. In the NASS 
scheme, segments are selected systematically with probabilities proportional to expanded size, 
and segments with sufficiently large expanded acreage are included with certainty. The number 
of secondary units allocated to certainty segments is proportional to the size of the segment, 
up to rounding error. The rounding is performed by the systematic selection scheme. Let 
Tne be the conditional probability that secondary unit fin segment k of stratum / is selected, 
given the sample of segments selected at the first phase of the sampling procedure. We have 


L "h -1 
The = D ( Me None 8 Mus) Nang! (3.4) 


h=1 k=1 


for secondary units in segments with A,, > 0, where N, is the population number of 
segments in stratum h, M,, is the number of secondary units in segment k of stratum h, 
and n, is the number of segments in stratum A selected at the first phase. The conditional 
probability of observing a segment with zero acres at the second phase is one. 
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Then the mean estimator given in (3.1) can be written as 


VE Kp Mhk 
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, (3.5) 


where N,n,, | is the inverse of the first stage selection probability, K ; 18 the number of second 
phase segments drawn from stratum h, and K = LY K,. Given an appropriate scale, the 
numerator of (3.5) is an estimator of the total production and the denominator is an estimator 
of the total area. It can be shown that the numerator is an unbiased estimator by taking 
expectations, conditioning on the first phase sample units and then averaging over first phase 
samples. The denominator is a stratified estimator of the total number of secondary units. 
By the nature of the sampling, the number of sampling units is proportional to acreage and 
one can choose the scale so that the number of secondary units is equal to acreage. Hence, 
y can be viewed as the ratio of an unbiased estimator of the total production of the crop 
to an unbiased estimator of the total area under the crop. 

To estimate total state production, NASS multiplies 7 by A, where A is the estimator of 
total crop acreage defined by 


1g Nh 
Ac ye N,n;, ! Vs, Ane (3.6) 
isl fee 


Thus, the estimated total production is 


Y= yp" (3.7) 


3.2 Simple Variance Estimators 


Under the assumption of simple random sampling of secondary units from the entire set 
of secondary units available at the second phase, the estimated variance of ¥ conditional 
on the second phase segments is 


D 
VC OMA ee 1) aes en tare) Na (3.8) 


t=1 


where the subscript 2 on V is used to denote conditional variance and the subscript ¢ on Y 
replaces the triple subscript hkf. The sum over ¢ is the sum over the D secondary units in 
segments with postive acres. 

Because of the simplicity of expression (3.8), it has been suggested that it be used as an 
estimator of the unconditional variance. It has also been suggested that the variance of the 
estimated total state production be estimated with 


A 


V. (Y) = A*V,(9) + 9? V(A) + V(A)V2(9), (3.9) 
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where A is defined in (3.6) and V(A) is the usual variance estimator for a stratified estimated 
total, 


L n 
V(A) = YY Nini! (tn = W~" YE (Ane — An)’, (3.10) 
| k=1 

and 


Nh 

Ait. aaee 

Ap, = Ny, 3 Ang - 
koe) 


The estimator (3.9) is an estimator of the variance of a product based on an implicit assump- 
tion that p and A are uncorrelated. 

Evaluation of the extent to which the estimator (3.9) tends to underestimate the variance 
of Y is difficult. We can express the unconditional variance of jy as 


V(¥) = Vy (A.(9)3 + A: (V2) 3} 
a Ie, Ap 
= Viel Arey alate a 3), clches Veen 2 OW) ls (3.11) 
h=1 k=1 


where Y,4. = Myx Ynx. is the total for the k-th segment in stratum A, and E£, and V, denote 
the expectation and variance, respectively, with respect to first phase sampling. 

The estimator V>(/) is unbiased for the second component of expression (3.11) under 
simple random sampling of secondary units. Because sampling at phase two of the NASS 
scheme is done systematically, V>() is a biased estimator of V>(). The nature and extent 
of this bias depends upon the correlation structure of the list used in sample selection at 
the second phase. Also affecting the bias in V>(/) as an estimator of the true variance is 
the fact that formula (3.8) was derived under an assumption of replacement sampling at phase 
two. To the extent that phase two sampling is actually done without replacement (because 
samples are drawn systematically from the list of expanded segment acreages, a segment is 
sampled more than once only if it is large), V,(¥) will overestimate V3 (j). 

The estimator V, (Y) contains no estimator of A”V,; {E>(¥)}, and this produces a 
negative bias. However, estimation of that component is not easy, even under the simplify- 
ing assumption of probability-proportional-to-size sampling at phase two. Because of these 
considerations, the performance of V, (Y) will be studied by Monte Carlo methods in Sec- 
tion 3.4. 


3.3 Alternative Estimators of Variance 


An alternative approach to the estimation of V(j) is to view the sample as a two phase 
sample, as shown in Table 1, and to assume that the unconditional probability of selecting 
a segment to receive a secondary unit is proportional to the conditional probability given 
the first phase segments. 

Let a;, be the conditional probability that segment & in stratum A is included in the 
second phase, given the first phase sample of segments. We have 


Tre = MIN (1, Max Tage)» (3212) 
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where 7p, ~ is a constant within segment Ak. If ,, = 1 and the segment is selected to receive 
more than one secondary unit, it is assumed that the secondary units are independently drawn. 

Let 2;, be the unconditional probability that an observation is made on segment k in 
stratum / at phase two. If A,, = 0, then 7;, is the unconditional probability that segment 
hk is selected to receive at least one secondary unit. If A,, = 0, then aj, is equal to the pro- 
bability that segment hk is selected at the first phase of sampling. Let 


, i) if Ale =90% 


(3:13) 
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The = The — WI OVont pe 1, 
Np 


where 7,, defined in (3.12), is the conditional probability that the hk-th segment is selected 
in phase two, given the first phase sample. 

In our analysis we assume the z;, to be fixed. This will be so and the probability z;, 
will be the true unconditional probability if 2,, is a specified multiple of M,, where the 
multiple is fixed before sample selection. Expression (3.13) will be an approximation if 7, 
is a function of the segments selected at the first step of the selection procedure. 

Expression (3.13) is proportional to M,, for Mi,a,, < 1. If Mig aa > 1, then the 
number of selected secondary units is greater than or equal to one. The correct number of 
secondary units to allocate to such segments to maintain a self-weighting sample of secon- 
dary units is My, 7px. In practice, the number of secondary units observed as a result of 
probability-proportional-to-size systematic sampling never differs from M,, 7px, by more 
than one. 

To simplify the remaining computations, we assume that the systematic sampling design 
contains no rounding error. In other words it is assumed that the number of secondary units 
observed per segment is equal to the number required for a self-weighting sample. Thus, 
it is assumed that the number of secondary units observed in a segment drawn as part of 
the second phase of sampling is 


My, = 1 Thu oem ppm ts 
(3.14) 


Mp = Mok Thee if Thk = Le 


Under this assumption, an unequal probability combined ratio estimator of the mean yield 
is equivalent to estimator (3.1). The combined ratio estimator is 


ra 
J, =M,' De ye The MaDnx.» (3.15) 
where 
Pin ye ‘2 Ynke if fA > 0; 


Dnk. = 9 if A,, = 0, 
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M, = » 2. Thk | Max - 
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In expression (3.15) and the remaining expressions of this section, the reader can read A, 
(total area) for M, (total secondary units), if so desired. 

In the following discussion, replacement sampling of segments with probabilities propor- 
tional to the area of a crop within the segment is assumed as an approximation to the 
probability-proportional-to-size systematic sampling scheme of the second phase. An estimator 
of the variance of y under the assumption of replacement sampling is 


My Kp 
VAS) =, NO KK, 1) Oe (tne Wie ene (3.16) 
h=1 Kao 


where 


Unk = Max (Vax. — Jr); 


Kh 
Un. = Kp yy TWhk Unk 
k — 
An estimator of the total production is 
Y,=NM, J;; (3.17) 


where 


IL Nh 
M,, = \e Wi ny ' 2D Max 
sal (aaa 


N is the total number of segments in the population and W, = N~'N,. The Taylor approx- 
imation of the unconditional variance of the approximate distribution of Y, is 


V{Y,} = N* [Mh Viy,} + 2Mnyn C{d,, Mn} + ¥N VIM), (3.18) 


where J, is given in (3.15), M,, is defined in (3.17), 


EL Nh 
MN ee Ss ‘3 Mn; 


h=1 k=1 
Pe iL Nn ihe Nh 
iv =(¥ ys Mix) Ne ane gre 
h=1 k=1 h=1 k=1 


Yar. = MixYnx, is the total for the k-th segment in stratum h, and C{j,, M,,} is the 
covariance between J, and M,,. 
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Under the unequal-probability-fixed-take procedure, the estimator j,(+j) is approx- 
imately conditionally unbiased for the mean yield for the n = Ln, segments in the first phase 
sample. The mean yield of the nm segments is 


Therefore, the covariance between ¥, and M,, is the covariance between M,' Y,, and M,,, 
where 


jb Nh 
21 
iE Win iy Yk. « 
ee k=1 


Using the common approximation for a ratio, the covariance between j, and M,, can be 
approximated by 


= My'(Ct¥,, My} —¥n VM, 31. (3.19) 


If the probability of observing the pair (Y,,., M,,) is proportional to 7;,, an estimator of 
the covariance between Y,, and M,, is 


Eb 
Mn} = )) Wing! Suva (3.20) 
al 


Gives 
where 
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The estimator S,yy, is constructed as a degrees-of-freedom adjustment to a Horvitz- 

Thompson ratio estimator of the mean of the products (M;, —M,,)(Yix. — Y,..). The 

degrees-of-freedom adjustment, the factor K, (K;, — 1)~!, is introduced because it is 

necessary to replace the population means with sample means when constructing the product. 
Substituting (3.15), (3.16), and (3.20) into (3.18) gives 


VAY SINE MV (pipet 2iCUL A MPVS (3.21) 
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where V{ M,,} is the variance estimator for a stratified mean. Equation (3.21) is a stratified 
double sampling estimator of the variance of the estimated total state production. Unlike 
the estimator V,(Y) of (3.9), estimator (3.21) does not assume that the yield and acreage 
estimators are uncorrelated. Equation (3.21) also uses an unconditional estimator of the 
variance of yield. 


3.4. A Monte Carlo Comparison of Estimators 


A Monte Carlo study was performed to illustrate the differences among alternative 
estimators. Cotton acreage data from the 1983 June Enumerative Survey in California and 
data from the corresponding 1983 objective yield survey were used as a basis for the study. 
For purposes of the Monte Carlo study, 28 strata were considered to have cotton. 

Table 2 shows the distribution of cotton among the 28 strata as observed in the 1983 June 
Enumerative Survey. Fecso and Johnson (1981) describe the six different land uses, where 
land use is the first two digits of the stratum identification, as follows: 


1300 - 50% or more cultivated land, primarily general crops with less than or equal to 
10% fruit or vegetables; 

1700 - 50% or more cultivated land, primarily fruit, tree nuts, or grapes mixed with general 
crops; 

1900 - 50% or more cultivated land, primarily vegetables mixed with general crops; 

2000 - 15-50% cultivated land with extensive cropland and hay; 

3100 - residential mixed with agricultural lands, more than 20 dwellings per square mile; 

4100 - less than 15% cultivated land, primarily privately owned rangeland. 


A population was simulated from the results of the 1983 June Enumerative Survey. Table 2 
compares the characteristics of the simulated population to the results of the survey. In the 
simulated population, cotton was determined to be present in segment k (kK = 1, ..., Np) 
within stratum h (h = 1, ..., 28) if X,, = 1, where X,, is an independent Bernoulli (p,) 
random variable and p, is the observed proportion of segments in stratum / found to have 
cotton in the 1983 June Enumerative Survey. 

The next step in the creation of the population was the assignment of cotton acres to the 
segments for which X,, = 1. A set of 1983 observed ratios of segment cotton acreages to 
the average segment acreage was compiled for land use substrata having more than one seg- 
ment with cotton in the 1983 June Enumerative Survey. This set of observed ratios was used 
to generate the number of cotton acres in segments having cotton. If X,, = 1, then a ratio, 
rnk, Was drawn from the set of observed ratios such that each observed ratio in the set had 
an equal probability of selection. The number of acres of cotton in segment hk, M,,, was 
defined by 


Mie = VigMn, ; (4.1) 


where M,, was the observed average number of cotton acres for segments with cotton in 
stratum / in the 1983 June Enumerative Survey. (See Table 2.) 

Results of the 1983 objective yield survey for cotton were used to simulate yield observa- 
tions within segments. Since estimated yields were not readily accessible, an alternative variable 
— amajor component of yield estimates - was used. This variable is the number of plants 
per 100 square feet. The estimated overall population mean number of plants per 100 square 
feet was 79.6 for the 1983 objective yield survey. Table 3 shows the average number of plants 
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Table 2 


Cotton Acreage Estimates from the 1983 June Enumerative Survey 
in California and Cotton Acreages in the Simulated Population 


Percentage Mean Acres 

Target Number of | Number of of Segments Cotton in Segments 
atin Segment Segments Segments with Cotton with Cotton 

Size in Sampled a ee 

(Acres) Stratum in 1983 Simulated Simulated 

1983 Population 1983 Population 
1314 640 291 10 60 60 197 200 
1315 640 291 10 100 100 354 348 
1316 640 291 10 90 89 167 173 
1317 640 291 10 90 92 149 148 
1318 640 291 10 50 53 481 422 
1319 640 291 10 20 19 249! 260 
1320 640 291 10 90 9] 154 55) 
1321 640 291 10 60 61 270 274 
1322 640 291 10 70 71 205 210 
1323 640 291 10 80 79 288 279 
1713 320 432 10 30 28 125 12 
1714 320 432 10 30 31 58 5i7/ 
715 320 432 10 20 22 86° 84 
1716 320 432 10 10 8 86° 89 
LT 320 432 10 40 38 26 27 
1718 320 432 10 30 29 144 144 
1719 320 432 10 30 31 65 67 
1720 320 432 10 30 30 38 35 
1721 320 432 10 30 29 133 138 
1722 320 432 10 50 47 130 131 
1723 320 432 10 40 40 76 76 
1906 640 362 10 70 73 117 2H 
1907 640 362 10 70 74 192 194 
1908 640 362 10 80 83 253 246 
2010 640 649 10 30 31 303 306 
2011 640 649 10 40 4] 175 165 
3107 160 1,847 5 20 22 ay! 25 
4110 2,560 1,044 10 10 10 178 165 


' Number of segments sampled was less than or equal to 2. Average of all segments in substrata within land use 
stratum 13 is shown. 

? Number of segments sampled was less than or equal to 2. Average of all segments in substrata within land use 
stratum 17 is shown. 

3 Number of segments sampled was less than or equal to 2. Approximate acreages for this agri-urban stratum are 
shown. 


per 100 square feet. The estimated overall population mean number of plants per 100 square 
feet was 79.6 for the 1983 objective yield survey. Table 3 shows the average number of plants 
per 100 square feet by stratum for the 1983 survey. The average for each stratum is based 
on all secondary units within the stratum that were drawn as part of the probability- 
proportional-to-estimated-size sampling scheme. 

An analysis of variance of the 1983 plant data (Table 4) shows that 28 percent of the total 
variation among secondary units was due to between-segment differences within strata 
(s?, = 378.0), whereas 58 percent of the total variation was due to variation among secon- 
dary units within segments (s?, = 776.6). If the stratum component is treated as fixed, 67 
percent of the within-segment variation is due to variance among secondary units. 
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Table 3 


Average Number of Plants per 100 Square Feet from the 1983 
Objective Yield Survey for Cotton in California and in the 
Simulated Population 


Average Number of 
Plants per 100 Square Feet 


Stratum SSS SS Se 
1983 Objective Simulated 


Yield Survey Population 
1314 78 76 
1315 80 80 
1316 67 68 
1317 72 73 
1318 80 80 
1319 93 93 
1320 92 9] 
1331 70 69 
1322 84 84 
1323 72 71 
1713 Tis 117 
1714 96! 95 
1715 96! 93 
1716 96! 86 
1717 96! 96 
1718 139 140 
1719 96! 97 
1720 96! 97 
1721 89 86 
1722 79 79 
1723 84 85 
1906 98 98 
1907 67 67 
1908 53 53 
2010 118 118 
2011 47 47 
3107 80° 79 
4110 60 59 


' Number secondary units observed was less than or equal to 2. Secondary unit average for 


land use stratum 17 is shown. 
* Number secondary units observed was less than or equal to 2. Secondary unit average for 
all strata is shown. 


Table 4 
Analysis of Variance for the 1983 Objective Yield Survey Data 


ae Degrees of Sum of Mean Variance Percent 
Freedom Squares Square Component of total 

Stratum 26 80,193 3,084.3 187.3 14 

Segment within 

Stratum 85 124,086 1,459.8 378.0 28 


Residual 103 T2.9GT 776.6 776.6 58 
Total 214 284,270 1,341.9 100 
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When a segment had cotton, the mean number of plants per 100 square feet for segment 
hk was simulated by 


Crk = Ch + nes (4.2) 


where ¢,, is the average number of plants per 100 square feet for stratum h, e,, is distributed 
N(0, sz), and s} = 378.0. In the event that the simulated segment mean (¢;,) was less than 
10% of the stratum mean, then c;, was set equal to (.10)é, . Table 3 compares the simulated 
stratum means with those from the 1983 objective yield survey. The overall mean in the 
simulated population was Vy = 79.6. 

From the simulated population 500 June Enumerative Survey samples were drawn using 
stratified random sampling. A total of 275 segments were drawn for each of the simulated 
samples. The number of segments drawn from each stratum was the same as that for the 
1983 June Enumerative Survey (see Table 2). For each of the simulated samples, estimates 
of the mean number of acres per segment in the population, as well as the conditional pro- 
babilities z,,, from (3.12), that the segments in the sample would receive plots in a draw, 
were calculated. These conditional probabilities were used at the second stage of sampling 
in the single start probability-proportional-to-estimated-size systematic sampling described 
in Section 2. Objective yield survey samples were simulated by selecting 220 secondary units 
using this systematic sampling scheme. Two objective yield survey samples were simulated 
for each of the 500 simulated June Enumerative Survey samples. 

When a segment was selected to receive a secondary unit, the yield (number of plants per 
100 square feet) observed within a field was simulated under the assumption that the coeffi- 
cient of variation within each segment was constant. The observed number of plants was 
defined as 


Vane = Cae + Sw CrrSnne> (4.3) 


where yp, is the estimated average number of plants per 100 square feet for the &th secon- 
dary unit in segment k of stratum h, and f),, is distributed N(0, 1). The within-segment 
standard error is the square root of the s%, = 776.6 of Table 4, and Jy is the overall mean 
number of plants per plot. In the event that y,,, was less than 10% of the stratum mean, 
then y;,, was set equal to (.10)é,,. Similarly, if y,,. was greater than 190% of the stratum 
mean, then y,,, was set equal to (1.9)Gp,. 

Results of the simulations for cotton acreages are summarized in Table 5. The estimated 
mean acres per segment is 


ig Np 
An= Y) Wane’ VY) An: (4.4) 
ens eee | 
with estimated variance 
L Nh 
V(A,) = 3 Wi? n, | (np, — 1)7! ye (Ang — An)’. (4.5) 


1 k=1 
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Table 5 


Estimated Cotton Acreages from 500 Simulated 
June Enumerative Survey Samples 


AS V(A,) 
Average 9.93 0.64 
Range 8.13 - 12.21 


Variance 0.66 0.016 


The average cotton acres per segment in the simulated population was 9.94, while the average 
of the 500 sample estimates was 9.93. The actual variance of the stratified estimator A, was 
0.63, while the average estimated variance for the 500 simulated samples was 0.64. Because 
the variation in estimated cotton acreage is small, 7;, provides a stable estimate of the 
unconditional probability that segment k in stratum / is selected to receive at least one secon- 
dary unit. 

In addition to the estimators discussed previously, random group estimators of the variance 
were constructed. Two sets of random groups were formed for each objective yield survey 
sample. One set contained five groups (vy = 5) and one set contained ten groups (y = 10). 
Random groups were created by dividing the primary sampling units, the segments, into 
subsets within each land use substratum. The first group in each set of groups was obtained 
by drawing a simple random sample without replacement of size K;,(,) = n,/y from the 
sample of segments selected from each stratum (A=1, ..., 28) of the parent June 
Enumerative Survey sample. The second random group was obtained in the same fashion 
by selecting Ky.) segments from the remaining n, — K;,,,) segments in each stratum. The 
remaining random groups were formed in a like manner. One land use substratum, stratum 
number 3107, had a sample size of n, = 5 segments. Acreage and yield values of the 
observed five segments were repeated to form the ten observations required to create ten 
groups when y = 10. 

Let D, be the number of secondary units with positive acres which were selected during 
the objective yield survey in random group a where a = 1, ..., y. Let ¥(.) denote the yield 
estimator obtained from the a-th random group: 


Do 
Vis = Da) pagan? (4.6) 


t=1 


where V.) is the analogue of equation (3.3) for the a-th group. The random group estimator 
of the variance of y is then given by 


V oy (Y) =(y-1)7! “ (F(x) — jy)’. (4.7) 


This estimator is slightly biased for the ten group estimator because one stratum contained 
only five observations, and these observations were repeated in the groups. 
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Similarly, let Yi) denote the total production estimator obtained from the a-th random 
group: 


Yay = N Mya) F(a) (4.8) 
where 
RE Kp a) 
Maa) = Y WaK ica) Maka)» 
eat aa 


Mok a) 1S the number of acres of cotton in segment k of stratum / for random group a and 
Kj) 18 the number of segments in stratum / for the a-th group. The random group 
estimator of the variance of Y is then given by 


UF 
AO Oras Ghee eG Ele Sienye ae (4.9) 


Ce 


Tables 6 and 7 summarize the results of the Monte Carlo study for yield and production 
estimators. Average values of the estimates and their variance estimates across the 1,000 
simulated objective yield survey samples are shown in the tables. Simulation of two objec- 
tive yield survey samples for each June Enumerative Survey sample made the estimation of 
between - and within - June Enumerative Survey variance components possible. 

The estimator (3.1) currently used, 7, and the combined ratio estimator (3.15), ¥,, which 
is based on the 1, calculated from June Enumerative survey results, provide estimates with 
similar accuracy (see Table 6). The equal efficiency is partly due to the accuracy with which 
the unconditional selection probabilities are estimated in each sample. 

As was shown in Section 3.2, the conditional variance V,(¥) is an underestimate of V(). 
For this simulated population, V, (9) underestimated the observed variance of 7 by 38%. 
The observed variance of 7 was 11.57 as compared to an average of 7.21 for V>(¥). This 
underestimation of the variance was consistent across samples. The estimated variance of 
V>(¥#) was 0.99, with V,(¥) ranging from a low of 3.85 to a high of 11.24 in the 1,000 
observations. Thus, the maximum observed estimate of the conditional variance was less 
than the true variance. 


Table 6 


Monte Carlo Properties of Yield per Acre Estimates 
and Estimated Variances’ 


Estimator 
y V>(9) Vis (¥) Vero (¥) J, V(3,) 
Average 79.74 Yl 12.62 12.39 79.76 12.39 
Total Variance ik esi 0.99 74.58 36.86 THESE AS) 
Between JES 7.60 0.48 6.10 4.56 7.64 7.61 


Within JES 3.07, OLS! 68.48 32.5) Be 4.90 


' Two objective yield survey samples were simulated from each of 500 simulated June Enumerative Survey samples. 
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Table 7 


Monte Carlo Properties of Production Estimates 
and Estimated Variances’ 


Estimator? 
Y Vir) V5 (¥) Vo (Y) Y, V(Y,) 
Average 73.04 40.85 48.99 48.53 TSLOn 48.73 
Total Variance 49.69 82.52 1245.10 608.80 49.58 222.96 
Between JES 46.35 Ths 50.82 208.48 46.30 199.58 
Within JES 3.34 4.35 1194.28 400.32 3.28 23.38 


! Two objective yield survey samples were simulated from each of 500 simulated June Enumerative Survey samples. 
There were N = 92,240 segments in the simulated population. 
2 The estimator Y is in millions of units and variances are in the corresponding units. 


Assuming probability-proportional-to-size sampling with replacement of segments at the 
second phase, V>(#) was shown in Section 3.2 to be unbiased for the variance of j condi- 
tional on the sample of segments selected at the first stage of sampling. The estimate of the 
expected value of the conditional variance of ¥, V> (9), from the Monte Carlo study is 3.97. 
This large discrepancy (3.97 versus 7.21) can be attributed to the fact that the estimator 
V>(¥#) ignores the effects of stratification in the population (see Tables 2 and 3) and to the 
fact that V,(¥) was derived under the assumption that segments are selected with replace- 
ment at the second stage of sampling. 

The estimator (3.9), V,(Y), underestimates the unconditional variance of Y. While the 
observed variance of Y from the Monte Carlo simulations is 49.69 (million)’, the average 
of the V, (Y) is only 40.85 (million)*. This 18% underestimate of the true variance occurs 
for a number of reasons. As was shown previously, there is a negative bias in V,(¥) as an 
estimator of V(#); another important factor contributing to the bias is the failure of V, ( Y) 
to take into account the covariance between M,, and J. In this example, the bias caused by 
omitting the covariance term partially balances the bias associated with V(j). 

Using expression (3.16), V(¥,), as an estimator of the variance of 7, and expression 
(3.21), V(Y,), as an estimator of the variance of Y, provided results which are much more 
satisfactory than those of the estimators currently used. The Monte Carlo average of the 
estimates V(j,) was 12.51, which overestimates the observed variance of ¥, (11.57) by about 
7%. About one-third of the overestimate (2-4%) can be attributed to the use of sampling 
without replacement at the first two stages of sampling. The remaining difference of about 
4% is small relative to the standard error of the estimated difference. The variance of the 
difference was estimated by estimating the variance of the mean of z,;, where 


Zp Sari aoe 10)” = Oni) » (4.10) 


for the j-th yield sample (j=1, 2) within June Enumerative Survey sample 
t (t=1, ..., 500). The estimated standard error of the difference was 0.58. Thus, the average 
value of V(¥,) is within 1.5 standard errors of the estimated variance of ¥,. The average 
estimated variance of Y, is within 2 percent of the variance observed in the Monte Carlo 
simulations. 
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Random group estimators of the variance of ¥ displayed little bias. The Monte Carlo 
averages of estimators V5 (7) and Vio (Y) were 9% and 7%, respectively, larger than 
the corresponding Monte Carlo variances. These differences are not significantly different 
from zero and are comparable to those obtained for the estimator V(j,). The variance 
estimator V(¥,), however, is a much more stable variance estimator. The coefficient of 
variation for the estimator V(¥,) is about 30%; it is 75% for V,s(¥). As expected (Wolter 
1985), an increase in the number of random groups resulted in a decrease in the coefficient 
of variation of the random group variance estimator. The coefficient of variation for 
Vero ( ¥) was 50%. Differences among random groupings and yield samples within June 
Enumerative Surveys accounted for most of the variance in the random groups variance 
estimators. 


4. CONCLUSIONS 


Analyses show that the estimators of statewide average yield and total production cur- 
rently used by the National Agricultural Statistics Service are satisfactory. However, the simple 
variance estimators V; (¥) and V, (Y) were shown to have a negative bias, where the extent 
of the underestimation is a function of the within-segment variance and of the within-segment 
sampling rates. The estimator V,() underestimated the true variance of 7 by nearly 40%, 
and V, (Y) underestimated the true variance of Y by 18% for the simulated California cot- 
ton population. 

The alternative estimators, 7, and Y,, were developed by viewing the yield sampling 
scheme as a two-phase process in which segments found to contain crop acreage during phase 
one (the June Enumerative Survey) are subsampled during phase two to estimate yield. The 
unconditional probability of selecting a segment to receive a secondary unit within a stratum, 
pz, iS estimated by assuming that this probability is proportional to the conditional prob- 
ability of selecting segments at the second phase of sampling. With this assumption, the une- 
qual probability combined ratio estimator of the mean yield, ¥,, and the estimator of its 
variance, V(j,), were developed. The estimator of the total Y, is a two-phase product 
estimator of the mean production per segment, where the estimator of the mean of the aux- 
iliary variable (crop acreage) comes from the June Enumerative Survey (phase one of sam- 
pling). The variance estimator V(Y,) is a stratified double sampling (two-phase) estimator 
of the variance of Y,. 

As shown by the Monte Carlo study, ¥, and Y, give estimates that are comparable to their 
currently used counterparts, » and Y. Both V(j,) and V(Y,) are accurate variance 
estimators in samples of the size typically used by NASS. These results are due, in part, to 
the precision with which average crop acreages are estimated by the June Enumerative Survey. 
Precise acreage estimates produce estimates of selection probabilities that are close to the 
unconditional probabilities of selection. In addition, the ratio form of the estimator reduces 
the effect of replacing true unconditional probabilities with estimators. 

Random group variance estimators are also essentially unbiased estimators of the variance 
of estimated yield and production. However, random group estimators are much less stable 
than V(j,) and V(Y,). Therefore, estimatorsV(¥,) and V(Y,) are recommended over ran- 
dom group estimators. 

The June Enumerative Survey forms phase one of the objective yield survey. Sampling 
procedures for the June Enumerative Survey are straightforward and, as was shown by the 
Monte Carlo study, provide accurate acreage estimates. Hence, no change in the overall design 
for phase one of the objective yield survey is recommended. 
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A number of modifications for phase two of the objective yield surveys should be 
investigated. The current procedure for estimating yield is a two phase procedure in which 
a combined ratio estimator is used. In states where the sample is relatively large, indepen- 
dent sampling at phase two within individual strata or for groups of strata, as well as the 
use of a separate ratio estimator should be considered. 

Systematic sampling at phase two should be replaced if unbiased estimators of the variance 
are desired. Segments for yield sampling at phase two are now selected by computer at a 
national level so it should be relatively easy to change to a selection procedure with known 
joint selection probabilities. Estimators similar to those recommended for the current design 
would still be suitable if the same selection probabilities were retained. The scheme described 
by Fuller (1970) is one procedure that can be computerized, for which joint selection prob- 
abilities can be calculated, and which maintains specified selection probabilities and a degree 
of control similar to that of systematic sampling. 
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Current Issues on Seasonal Adjustment 


ESTELA BEE DAGUM! 


ABSTRACT 


This paper discusses three problems that have been a major preoccupation among researchers and 
practitioners of seasonal adjustment in statistical bureaus for the last ten years. These problems are: 
(1) the use of concurrent seasonal factors versus seasonal factor forecasts for current seasonal adjust- 
ment; (2) finding an optimal pattern of revisions for series seasonally adjusted with concurrent factors; 
and (3) smoothing highly irregular seasonally adjusted data. 


KEY WORDS: Concurrent vs forward seasonal factors; Revisions; Trend-cycle filters; Smoothing. 


1. INTRODUCTION 


During the last decade, within the domain of seasonal adjustment, statistical bureaus have 
focused their attention on three important issues: (1) the seasonal adjustment of a current 
value; (2) the revisions of concurrent seasonally adjusted data; and (3) the smoothing of highly 
irregular seasonally adjusted series. 

The main purpose of this article is to discuss each of the above problems with respect 
to the X-11-ARIMA seasonal adjustment program developed by Dagum (1980) and which 
is applied by Statistics Canada and other statistical bureaus of the world. 

The four modes in which the X-11-ARIMA computer package can be used to produce 
a current seasonally adjusted value are discussed in Section 2. In Section 3, the focus is on 
analysis of the revisions of concurrent seasonally adjusted data based on the linear filters 
of X-11-ARIMA. Section 4 deals with the nature and characteristics of the smoothing (trend- 
cycle) filters available in X-11-ARIMA. 


2. SEASONAL ADJUSTMENT OF CURRENT VALUES 


The seasonal adjustment of a current value can be done using either a ‘‘concurrent”’ 
seasonal estimate or a seasonal ‘‘forecast’’. 

A ‘‘concurrent’’ seasonal estimate (factor or effect depending on whether a multiplicative 
or additive model is assumed) is obtained by seasonally adjusting, each time a new observa- 
tion is available, all the data available up to and including that observation. On the other 
hand, a seasonal ‘‘forecast’’ is obtained from a series that ended in the previous year. A 
common practice is to generate these seasonal forecasts, say for year ¢ + 1, from data that 
ended in December of the previous year f. 

There are four modes in which the X-11-ARIMA computer program can be applied 
to produce a current (last observation) seasonally adjusted value. These four modes are: 
(i) using ARIMA extrapolations and concurrent seasonal factors; (ii) using ARIMA extrap- 
olations and seasonal factor forecasts; (ili) using concurrent seasonal factors without the 
use of ARIMA extrapolations; and (iv) using seasonal factor forecasts without the use of 
ARIMA extrapolations. 


! Estela Bee Dagum, Time Series Research and Analysis Division, Methodology Branch, Statistics Canada, 13th 
Floor, R.H. Coats Building, Ottawa, Ontario, Canada K1A OT6. 
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While statistical bureaus use the four modes to obtain current seasonally adjusted values, 
not all of them do so with the same frequency. Thus, for example, the dominant mode in 
Statistics Canada is (i) followed by mode (iii) whereas in the U.S. Bureau of Labor, the domi- 
nant mode is (ii) followed by mode (iv). The current seasonally adjusted value produced by 
each type of seasonal adjustment varies and is subject to different degrees of error. 

Under the assumption of an additive decomposition model, the seasonal adjustment of 
a current value X, can be obtained by 


BA es ue Un (1) 
where S{” denotes a forward seasonal estimate; or by 
xp”) = X; — So (2) 


where S/°) denotes a concurrent seasonal estimate. 


The current seasonally adjusted value will become ‘‘final’’ in the sense that it will no longer 
be revised after m more observations are added. Thus, 


ee (3) 


where S;’”) denotes a final seasonal estimate. 


Therefore, the total revision of a concurrent and of a forward seasonal estimate can be 
written as 


rom = § _ $m ms; (4) 


ret) i SO SUM) eee (5) 


Under the assumption of an additive decomposition and no replacement of extreme values, 


S{, the final seasonal estimate from a series X,_,,, ..-, Xj, ---) X;4m Can be expressed by 
m 
Si = YT An X17 = A” (BYX, (6) 
j=—m 


where A, ; = A, —; are the symmetric moving average weights to be applied to the series. 
h“) (B) denotes the corresponding linear filter using the backshift operator B, such that 
B" = X,_,. Young (1968) showed that the length of this symmetric filter h“”)(B), for 
monthly series, is 145 but that it can be well approximated by 85 weights because the values 
of the weights attached to distant observations are very small and, thus, m = 42. 

Following equation (6) we can express a concurrent seasonal estimate ${°) and a forward 
seasonal estimate S;” by: 


~ 


0 
SO = YY) ho jX-j = AO (B)X, m = 42, (7) 


j=—2m 


where h‘°) (B) denotes the asymmetric concurrent seasonal filter; and 
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e 
S$0 = YP hyjX,_; = 1 (B)X,, m = 42, (8) 


j=—2m 


where fh” (B) denotes the asymmetric forecasting seasonal filter and ? = 1,2, ..., 12 for 
a monthly series. 

The revision of a concurrent seasonal estimate depends on the distance between the con- 
current and the final filter, that is, d[ A (B), A“ (B) ], and on the innovations of the 
new observations X;4. 1, X,4.9, -°:, X¢4m- 

Similarly, the revision of a forward seasonal estimate depends on d[h? (B), h’” (B) ] 
and on the new innovations introduced by X;_), ..., X, Xia. «0s Xtam: 

Theoretical studies by Dagum (1982a and 1982b) have shown that 


GURON By he By ea ay, OB), h Wy B) or’ = 1,2, ... 12: (9) 


The distance between the two filters is defined as the mean squared difference between 
the frequency response function of the filters over all the seasonal frequencies; a similar defini- 
tion is given in the next section (equation (17)) using the root mean squared difference. 

Relation (9) is true whether ARIMA extrapolations are used or not. Furthermore, the two 
studies also showed that 


d{h (B),h“ (B)] using ARIMA extrapolations 
(10) 
<d[h (B),A°™ (B)] without ARIMA extrapolations, 


and similarly 


d[{h® (B),h“™ (B)] using ARIMA extrapolations 
(11) 
<d[h (B),h\” (B)] without ARIMA extrapolations, 


TOlsCe— a eo, eee lee 


Studies by Dagum (1978), Bayer and Wilcox (1981), Kenney and Durbin (1982), McKenzie 
(1984), Dagum and Morry (1984), Pierce (1980) and Pierce and McKenzie (1985) have shown 
that 


(Om) & (bm) (12) 
except in a few cases where 


(Om) sp (Gm). (13) 


The relationship (13) can be observed when the current observations of the latest year 
are strongly revised since X, gets the largest weight in the estimations of S{°). 

From the viewpoint of the total revisions of the seasonal estimates, the results of the above 
empirical studies permit the ranking of the four modes as follows: mode (1) (ARIMA extrapo- 
lations with concurrent seasonal estimates) gives the smallest total revision; mode (iii) (no 
ARIMA extrapolations with concurrent seasonal estimates) ranks second; mode (ii) (ARIMA 
extrapolations with forward seasonal estimates) ranks third and mode (iv) (ARIMA extrapo- 
lations with forward seasonal estimates) ranks fourth. 
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3. REVISIONS OF CONCURRENT SEASONALLY ADJUSTED DATA 


Statistics Canada’s practice of using concurrent seasonal adjustment was first established 
in 1975 for the Labour Force Survey series. Gradually other foreign statistical agencies follow- 
ed it. The use of concurrent seasonal factors for current seasonal adjustment poses the pro- 
blem of how often should the series be revised. Kenny and Durbin (1982) recommended that 
revisions should be made after one month and thereafter each calendar year. Dagum (1982c) 
supported these conclusions and furthermore, recommended an additional revision at six 
months if the seasonal adjustment method is the X-11-ARIMA without the ARIMA extrapola- 
tion option. 

For any two points in time ¢ + k, ¢ + ?(k <8), the revisions of the seasonal estimates 
and consequently of the seasonally adjusted value is given by 


rh) = FO _ FO bce, (14) 


This revision reflects: (1) the innovations introduced by the new observations X;4,41, 
Xisk42. «+> Xt4x4e; and (2) the differences between the two asymmetric seasonal adjust- 
ment filters Y(9(B) and Y‘*)(B). If one fixes k = 0 and lets ? vary from | to m, then rela- 
tion (14) gives a sequence of revisions of the concurrent seasonally adjusted values for different 
time spans or lags. The total revision of the concurrent estimate is given for ? = m. If one 
fixes ? = k + 1 and lets k take values from 0 to m — 1, then relation (14) gives the sequence 
of single period revisions of each estimated seasonally adjusted value and in particular, if 
one starts atk = Oone obtains the m — 1 successive single period revisions of each estimated 
seasonally adjusted value before it becomes final. If one fixes ? = k + 12 and lets k take 
values from 0 to m — 12, then equation (14) gives the sequence of annual revisions. 

The revisions in which we are interested here are those introduced by filter discrepancies, 
and these can be studied by looking at the frequency response functions of the corresponding 
filters. Similarly to equation (6), we can approximate the seasonally adjusted value for recent 
years from the X-11-ARIMA program (with or without ARIMA extrapolations) by 


Xf = Ay, ND aed AT GANS (15) 
J=n 


Equation (15) represents a linear system where X;")(n) is the convolution of the input X, 
and a sequence of weights Y,, ; called the impulse response function of the filter. The proper- 
ties of this function can be studied using its Fourier transform which is called the frequency 
response function, defined by 


m 
I?) (Gy). = MB Y,j¢ 7%, —Ysws, (16) 
j=—n 


where w is the frequency in cycles per unit time. '‘”’(w) fully describes the effects of the 
linear filter on the given input. Monthly and annual revisions of the concurrent filter of 
X-11-ARIMA with and without the ARIMA extrapolations have been calculated by Dagum 
(1987) based on the mathematical distance between the various frequency response functions 
of the filters. The pattern is characterized by a rapid decrease in the size of the monthly revi- 
sions of the concurrent filter for 2 = 1,2, and 3; and a slow decrease thereafter until ? = 11; 
then a large increase occurs at ? = 12 followed by a decrease at ? = 13 and then another 
large increase at ( = 24 followed by a decrease at ? = 25. Dagum (1987) showed that this 
pattern of monthly revisions is the same whether ARIMA extrapolations are used or not. 
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The significant decreases for the first three consecutive revisions are due to the improve- 
ment of the Henderson (trend- cycle) filter weights. The reversal of direction in the size of 
the filter revisions at ? = 12 and ? = 13, is due to the improvements of the seasonal filter 
that becomes less asymmetrical from year to year until three full years are added to the series. 
The two largest revisions occur at ? = 1 and 2 = 12. Given the non-monotonicity of single 
monthly revisions, it is not advisable to revise the concurrent estimate any time a new obser- 
vation is added to the series. 

A revision scheme often used by statistical bureaus for their concurrent seasonally ad- 
justed series consists of keeping constant the concurrent estimate from the time it appears 
until the end of the year and then revising annually the current and earliest years. Therefore, 
first year revisions due to filter discrepancies are given by R®, R&%, .., RUL®, 
second year revisions by R‘!®, RO?) R31) third-year revisions by R (24!) R (2513) 
and so on where R“°*) is defined by 


REO = [27 ILO (@) — TO (we) I?’ do] 7, (17) 
bs OID. IS Wik SOM 25) S71 25 


and n = 42 for the X-11-ARIMA seasonal adjustment filters. 


Table 1 shows the first-, second- and third-year revisions of the concurrent seasonal 
adjustment filter for X-11-ARIMA without extrapolation and with extrapolations 
from one ARIMA model and two sets of parameter values (other cases are shown in 
Dagum 1987). The ARIMA model chosen is the classical (0,1,1) (0,1,1),. model that is 
Meshal — Fo) X, = (1 — 6B) (1 - OB'*)a, where X, denotes the original series, B 
is the backshift operator such that B”X, = X;_,, @ is a purely random process that 
represents the innovations and @ and © are the non-seasonal and seasonal parameters, 
respectively. 

Since the largest single period revisions occur at ? = 1 and ? = 12 as mentioned above, 
a better revision scheme would be to incorporate monthly and annual revisions. It is expected 
that (1) adjusting concurrently each month, say from January to November and revising 
only once when the next month is available, and (2) adjusting concurrently December when 
it first appears and then revising the first year and earlier years when January is added, should 
improve the reliability of the filter applied during the current year while maintaining 
simultaneously the filter’s homogeneity for month-to-month comparisons. 

The first-year revisions of the first-month revised filter would then be R“!), R)), 
.... R“L_ Table 2 shows these revisions and although the pattern is very similar to 
that of the concurrent filter, the size of the revisions are much smaller if no extrapolations 
are used. On the other hand, the improvement is less important if ARIMA extrapolations 
are used. Similarly, no major differences were observed for the second- and third-year 
revisions. 


3.1 Estimation of Trading Day Variations and ARIMA Models with Concurrent 
Seasonal Adjustment 


Besides the type of revisions scheme to be applied, there are two other problems posed 
by concurrent seasonal adjustment associated with trading day variations and ARIMA 
modelling. 
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Table 1 


First-, Second- and Third-Year Revisions of the Concurrent 
Seasonal Adjustment Filter of X-11-ARIMA 


With ARIMA Extrapolations 


Revisions Without ARIMA from a (0,1,1) (0,1,1);2 Model 
REA” Extrapolations 
6=.40 0©=.80 6=.80 0©=.80 
RR“) a el .06 
R 9) “13 me .08 
R9 Ais) 13 .08 
R “9 13 13 09 
R&>9) 415 13 09 
Ro” AF 13 09 
R“9) 16 ‘3 09 
R‘% 16 mG) 09 
R?9) 16 13 09 
Reo 16 14 09 
Rae 16 14 09 
R29) 29 28 26 
RGA) Qa 2207 26 
RY pA] 27 26 
R31) 7 26 26 
Roe 20 16 16 
Ree 18 AT, 16 
R624) 16 “il 7 16 
Table 2 


First-Year Revisions of the First-Month Revised 
Seasonal Adjustment Filter 


With ARIMA Extrapolations 


Revisions Without ARIMA from a (0,1,1) (0,1,1);. Model 
Reh Extrapolations 
¢6=.40 ©=.80 6@=.80 0©O=.80 

R&) .07 10 .06 
Ro) ‘07 .10 .06 
Re .07 10 .07 
ROY .08 .10 .08 
ROD 10 ll .08 
RW) oh ail .08 
R®) at silt .08 
RV?) abil mF .08 
Ri1) a ag) .08 


Re 12 19 .08 
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For series which are flows in the sense that they result from the accumulation of daily 
values over the calendar months, there is a systematic effect caused by trading day varia- 
tions. Trading day variations arise mainly because the activity varies with the days of the 
week. Other sources are associated with accounting and reporting practices. For example, 
stores that do their bookkeeping activities on Friday tend to report higher sales in months 
with five Fridays than in months with four Fridays. The trading day effects are estimated 
in the X-11-ARIMA program using ordinary least squares on a simple deterministic regres- 
sion model. Consequently, the weights estimated for each day change any time a new obser- 
vation is added to the series. Since regression techniques are very sensitive to outliers, these 
changes can be sometimes unnecessarily large. 

When the series are seasonally adjusted concurrently, the trading day estimates change all 
the time. In order to avoid unnecessary revisions, Statistics Canada’s practice is to use the weights 
calculated by the program at the end of the previous calendar year or the weights provided by 
the users, as priors for the current year. The weights are then revised on an annual basis. 

The effect of trading day variations must be removed from the series before ARIMA 
modelling, for these type of models cannot adequately handle trading day variations. In other 
words, if the X-11-ARIMA program is used with ARIMA extrapolations on series with trading 
day variations, these variations should be estimated a priori and if significant, they should 
be removed from the original series before the ARIMA modelling. 

Another problem associated with concurrent seasonal adjustment refers to how often the 
ARIMA models should be identified. The current practice at Statistics Canada is to use the 
automatic ARIMA model selection option once a year and if the model is accepted, then 
it is kept constant for a whole year, letting only the parameters change when more observa- 
tions are added. In order to keep the model constant, the user’s supplied model option should 
be applied. Maintaining the ARIMA model constant avoids unnecessary revisions that may 
result from changing of models back and forth simply because of the presence of outliers. 


4. SMOOTHING OF VOLATILE SEASONALLY ADJUSTED SERIES 


One of the main purposes of the seasonal adjustment of economic time series is to provide 
information on current economic conditions, particularly to determine the stage of the cycle 
at which the economy stands. Since seasonal adjustment means removing only seasonal 
variations, thus leaving trend-cycle variations together with irregular fluctuations, it is often 
difficult to detect the short-term trend or cyclical turning points for series strongly affected 
with irregulars. In such cases, it may be preferable to smooth the seasonally adjusted series 
using trend-cycle estimators which suppress as much as possible the irregulars without af- 
fecting the cyclical component. 

The use of trend-cycle values has been discussed by several writers and recently by Moore 
et al (1981), Kenny and Durbin (1982), Maravall (1986) and Dagum and Laniel (1987). 
Although not yet practised widely, some statistical agencies such as Statistics Canada and 
the Australian Bureau of Statistics smooth some of their seasonally adjusted series, particu- 
larly those series that are strongly affected by irregulars. 

The combined linear filters applied to the original series to generate a central (symmetric) 
estimate of the trend-cycle component have been calculated by Young (1968) for Census 
Method II-X-11 variant. This filter is similar to that of X-11-ARIMA with and without 
ARIMA extrapolations. Dagum and Laniel (1987) extended Young’s (1968) results to in- 
clude the estimation of the asymmetric trend-cycle filters of X-11-ARIMA with and without 
the ARIMA extrapolations. 
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Figure 1 shows the gain functions of the central (symmetric) seasonal adjustment filters 
and smoothed seasonally adjusted data (trend-cycle) filters. It is apparent that the trend- 
cycle filters suppress all the noise present in the series, where the noise is defined as the power 
present in all frequencies w < .166. This frequency corresponds to the first harmonic of the 
fundamental seasonal frequency of a monthly series. This pattern results from the convolu- 
tion of the seasonal adjustment filters with the 13-term Henderson trend-cycle filter. 

Figure 2a shows the gain functions of the concurrent and first-month revised trend-cycle 
filters of X-11-ARIMA without ARIMA extrapolations. Figure 2b shows their corresponding 
phase-shift functions expressed in months instead of radians. We can observe that the gain 
for all w <.166is much larger for these two asymmetric filters as compared with the central 
filter. Furthermore, there are large amplifications for frequencies near the fundamental 
seasonal. All this means that the concurrent and first revised smoothed seasonally adjusted 
values will have more noise than the final estimates. On the other hand, it is apparent that 
the phase shifts are very small, less than one month for the most important cyclical frequencies 
0 < w < .055 (1.e., cycles of periodicities equal to and longer than 18 months). 


—— Seasonal Adjustment Filter 


— — Trend-Cycle Filter 


0.75 


0.50 


0.25 


0.00 


.083 LO .250 303 417 


Frequency 


Figure 1. Gain Functions of the Central (Symmetric) Trend-Cycle and Seasonal Adjustment Filters 
of X-11-ARIMA. 
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Figure 2a. 


Gain Functions of the Concurrent and First-Month Revised Filters of X-11-ARIMA without 
ARIMA Extrapolations. 
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Figure 2b. Phase-Shift Functions of the Concurrent and First-Month Revised Filters of X-11-ARIMA 
without ARIMA Extrapolations. 
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Figures 3a and 3b show the gain and phase-shift functions of the concurrent and first- 
month revised trend-cycle filters of X-11-ARIMA with ARIMA extrapolations. The extrapola- 
tions are obtained from an IMA model (0,1,1)(0,1,1),. with @ = .40 and © = .60. The gain 
functions are closer to the symmetric (central) filter than those of X-11-ARIMA without 
the ARIMA extrapolations. There are no amplifications around the fundamental seasonal 
frequency and a similar attenuation of power at higher frequencies. On the other hand, there 
is more phase-shift (being near to one month) for low frequencies and less phase-shift for 
all high frequencies. 

Dagum and Laniel (1987) studied the time path of the revisions of the trend-cycle filters 
and compared them with those of the seasonal adjustment filters. Their results, as summarized 
in Table 3, show that the total revisions of the trend-cycle asymmetric filters converge to 
zero much faster than those of the corresponding seasonal adjustment filters. In fact, the 
total revision of the trend-cycle filter three months after the concurrent filter is only .1, whereas 
a close value is achieved for the seasonal adjustment filter only after 24 months have been 
added to the series. Except for the total revisions of the concurrent filter which is larger for 
the trend-cycle filters compared with the corresponding seasonal adjustment filter, in all the 
other cases the total revisions are smaller for the trend-cycle filters. Furthermore, the trend- 
cycle filter revisions converge much faster to zero as compared with those of the seasonal 
adjustment filters. 


Table 3 


Time Path of the Total Revisions of the Trend-Cycle and the Seasonal Adjustment 
Asymmetric Filters of X-11-ARIMA 


With Extrapolations from 


Without 
Extrapolations ¢ Clee cael 
Revisions 2 SS 2 eee 
R'bkx Trend-Cycle Seasonal Trend-Cycle Seasonal 
Filter Adjustment Filter Adjustment 
Filter Filter 
R80) 45 .36 41 532 
REY P| £33 .26 .32 
Ro aS 32 o15 uy. 
Ro) aft 32 sal 31 
Ree) nip 32 1’ 31 
Re .10 .23 .09 .20 
REA .07 13 .05 10 
REO .03 .05 .02 04 
Re) 01 01 01 01 


* ? = 48 for the ‘‘final’’ trend-cycle filter and ? = 42 for the final seasonal adjustment filter. However, the values 
shown for the revision of the seasonal adjustment filters are also calculated for ? = 48 since after ? = 42 the 
values are final and, thus, do not change. 
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Gain Functions of the Concurrent and First-Month Revised Filters of X-11-ARIMA with 
ARIMA Extrapolations (@ = .40, O = .60). 
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Phase-Shift Functions of the Concurrent and First-Month Revised Filters of X-11-ARIMA 

with ARIMA Extrapolations (9 = .40, O = .60). 
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On Efficient Estimation of Unemployment Rates from 
Labour Force Survey Data 


S. KUMAR and A.C. SINGH! 


ABSTRACT 


The method of minimum Q‘”? estimation for complex survey designs proposed by Singh (1985) pro- 
vides asymptotically efficient estimates of model parameters analogous to Neyman’s (1949) min X” 
estimation procedure for simple random samples. The Q‘”? can be viewed as a X° type statistic for 
categorical survey data, and min Q‘”” estimates provide a robust alternative to Weighted Least Squares 
estimates, which often display unstable behaviour for complex surveys. In this paper, the min Q‘7? 
method is first described and then illustrated for the problem of estimating parameters of a logit model 
for survey estimates of unemployment rates which are obtained from the October 1980 Canadian LFS 
data cross-classified according to age-education covariate categories. It is seen that the trace efficiency 
of smoothed estimates obtained by Kumar and Rao (1986), who applied the method of pseudo max- 
imum likelihood estimates (pseudo mle) to the same problem can be slightly improved by the min Q” 
method. Interestingly enough, pseudo mle for individual cells behave much the same way as the effi- 
cient min Q‘”? estimates for the particular LFS example. 


KEY WORDS: Pseudo mle; WLS estimator; Min Q‘’? estimator; Asymptotic efficiency; Approx- 
imate likelihood; Generalized score statistic. 


1. INTRODUCTION 


Based on October 1980 Labour Force Survey (LFS) data, Kumar and Rao (1984, 1986) 
proposed and analysed a logistic regression (logit) model for unemployment rates. They us- 
ed the theory developed by Roberts (1985) and Roberts, Rao and Kumar (1987) who generaliz- 
ed the Rao-Scott method (1981, 1984) of adjusting X* for impact of the underlying survey 
design to test the fit of the logit model. Kumar and Rao considered unemployment rates 
in various cells (or domains) that had been obtained by cross-classifying the population into 
a number of age and education categories. The logit model consisted of both linear and 
quadratic effects for the age variable, with only the linear effect for the education variable. 
The same LFS data were also analysed by Singh and Kumar (1986) using an alternative 
method, namely the Q‘”” test proposed by Singh (1985). The test Q'7? is aX? type test bas- 
ed on a generalized score statistic of principal components. Results obtained by the Q‘”? 
method were found to be in agreement with those arrived at by the adjusted X* method. 

Whenever a suitable model is determined, it is of interest to find good estimates of model 
parameters. These, in turn, provide fairly good estimates of true rates for domains. Such 
estimates (often called ‘‘smoothed estimates’’) are especially useful for domains in which 
survey estimates lack precision because the number of observations is not sufficient. It may 
be noted that since smoothed estimates are obtained after a model is found to have a reasonable 
fit, the bias in the estimates is expected to be negligible. Kumar and Rao (1986) used the 
method of pseudo mle (pseudo maximum likelihood estimates) under the working form of 
the likelihood that corresponds to independent binomial samples for estimating parameters 
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of a logit model after an adequate fit had been established for the October 1980 LFS data. 
They found a considerable gain in efficiency over survey estimates of unemployment rates 
in the particular LFS example. 

Pseudo mle are known to be useful when the likelihood function is not available or when 
it is difficult to compute due to complexities of the survey design. Under suitable regularity 
conditions, the pseudo mle provide consistent and asymptotically normal estimates (Imrey, 
Koch and Stokes 1982). In this paper we consider the problem of finding asymptotically ef- 
ficient (in a sense to be explained in Section 3) estimates of model parameters and therefore 
of domain estimates. We describe the min Q‘’? estimator, proposed in Singh (1985), based 
on the generalized scores approach which can be viewed as analogous to Neyman’s min X? 
estimator for simple random samples. It may be noted that the WLS (Weighted Least Squares) 
approach for complex survey designs (Koch, Freeman and Freeman 1975) also provides asymp- 
totically efficient estimates. However, these estimates are usually unstable for moderate sample 
sizes due to near singularity of the estimated covariance matrix of survey cell estimates (see 
Imrey, Koch and Stokes 1982, Fay 1985). The min OQ? estimates, on the other hand, are 
designed to guard against the instability problem mentioned above. It will be seen that the 
problem of instability can be overcome by the min Q‘”) method by employing a modified 
version of the estimated covariance matrix in which the relatively very small eigenvalues from 
its spectral decomposition are trimmed. 

The necessary notation along with a brief review of the test O‘”? are presented in Sec- 
tion 2. Next the min Q‘”? estimator and its asymptotic behaviour are described in Section 3. 
The example using LFS data is given in Section 4 as an illustration. For this numerical exam- 
ple, an interesting finding was that over individual cells, the pseudo mle perform almost at 
par with efficient min Q‘7? estimates. In terms of an overall measure as given by trace effi- 
ciency, pseudo mle are found to be only slightly inferior to min QO”? estimates. Finally, Sec- 
tion 5 contains some concluding remarks. 


2. THE TEST Q™: A BRIEF REVIEW 


We shall briefly describe the test Q'7) in order to motivate the min Q'7) method of 
estimation (for more details, see Singh 1985, Singh and Kumar 1986). Let J denote the number 
of disjoint domains and v; denote the parameter of interest for the i-th domain. Consider 
a modéelfor ve=" (vies, «. 2, U7). .as 


Ho: h(v) = X60 PS 


where X is a known J X r matrix of full rank 7, 6 is an r-vector of unknown parameters, 
and A is a continuously differentiable one-to-one function, for instance, log or logit. 

Let 0 denote the /-vector of survey estimates. Assume that under a suitable central limit 
theorem 


6 ~+ MVN(v,T/n) (2,2) 


where ‘‘ ~ ”’’ means “‘asymptotically distributed as’’, 7 is the total sample size, and I is the 
asymptotic covariance matrix of Vn (6—v). 
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Now, choose a small level €( >0) of dimensionality reduction (eg., .01 or .005 can be taken 
as working values of ¢). Find a number 7 such that with the eigenvalues \, > \p =>... = y 
of the estimated covariance matrix I’, we have 


I I 
i —ainax fis > rand De d;/ B hae} (2.3) 
i=t i=1 


The variable 7, although random, can be regarded as fixed for our asymptotics. It may be 
noted that if there are no relatively very small eigenvalues (i.e. if I is not ill-conditioned), 
then there will usually be no effect of dimensionality reduction for small « and T will coin- 
cide with J in those situations. 

Consider the problem of testing Hp against alternatives Ky: h(v) #4 X@ in the class of 
tests based on the first 7 principal components W of 6. Let the normalized eigenvector cor- 
responding to i; be P; (it need not be unique) and let M; denote the J x T matrix of 
eigenvectors P;’s corresponding to the first 7 largest eigenvalues. Then 


W = Mt ~ MVN(p, Dr/n), (2.4) 
where 
= My, Dr = diag (\, 6 5 oi Nr). 


Based on W, the original testing problem concerning an /-dimensional v is reduced to 
testing a hypothesis about the 7-dimensional parameter p given by 


HINT AL UXO) VS Koo ie Meh | (XO). (2.5) 


The test statistic O'7) can be obtained as a score statistic of principal components by 
employing the approximate likelihood of 6 given by the limiting distribution (2.4) of W for 
computing the efficient scores (see Cox and Hinkley 1974, p. 321-324). We shall refer to 
O°") as a generalized score test that would reject Hy for large values of the quadratic form 


Q'T) (8°) = Y(6°)’ArY (0°) — Zr(6°) 'A7Z7 (8°) (2.6) 


a XT=r 


where 
. A 
Vo eee u(y) Ap (PP. he) 


i=1 


Zr(0°) = B’ArY (6°), B = (4v/00), Ar = (B’A7B)™', 


and 6° is some fixed point in the null parameter space. In computing Q‘”?, any root n- 
consistent estimate of 6 under Hp can be substituted for 6°, such as pseudo mle of 6. Notice 
that 0‘? of (2.6) is in fact a quadratic form in W but is expressed in 6 for the sake of con- 
venience. 

For testing Ho vs Ko in the class of tests based on W, the asymptotic optimality of the 
test O'7) follows from that of the score statistic. For small e€ > 0, 6 and W will be close in 
the sense that principal components provide the best possible way of dimensionality reduction 
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with a minimum loss of information. Thus Q‘”? (for small e) is expected to be robust with 
respect to the test O corresponding to no dimensionality reduction. However, Q may be 
unstable (in the sense of inflated Type I error rate) for finite samples due to possible near 
singularity of I’. The test O'”? is expected to control this problem of instability at the cost 
of sacrificing some information in the data that gives rise to possibly unreliable components 
in Q in the directions of eigenvectors that correspond to relatively very small eigenvalues. 
The loss of information implies that the test Q‘7? will lack power for alternatives in direc- 
tions of (near) singularities. However, this loss of power is offset by the gain in control of 
Type I error rate. The instability control is further ensured by the fact that, since Hp is a 
subset of Hj, O'7? will be a conservative test for Hp. 

A special asymptotically equivalent version of Q‘”) (8°) which has a simpler expression 
similar to that of the standard Pearson-Fisher’s X’, is obtained by replacing 6° with an 
estimator 6 that minimizes the expression (6 — v(@))’A7(6 — v(@)). We then have 


Q'T) (6) = Y(6)’ArY(8) 


if 
= FIP: (6 — 0(8)))7/% (2.7) 
i=1 
bad 
ONT Ee 
Henceforth we assume that, for a given data vector 0, a model Hy has been deemed 
appropriate based on the test Q'7? or some other test such as the adjusted X° test. In the 
next section, we give an asymptotically efficient method of estimating parameters 6 under 
Hp, using the statistic Q‘7). The 6 estimates in turn provide a set of smoothed estimates 
of v corresponding to survey estimates 0. 


3. THE MIN QO” ESTIMATOR 
Consider the approximate likelihood for the mean yp of the first T principal components 
W of 0, given earlier by (2.4). Suppose the model Hp: h(v) = X@ is accepted. Then, the 


kernel function K(@) of the approximate likelihood for y(6@) is given by 


K(8) = (W — u(0))’D7' (W — u(8)) 


(0 — v(@))’Ar(d — v(6)) (3.1) 


The value 6 that minimizes K(@) corresponds to the mle of 6 for the approximate likelihood 
of » under Hy. The estimator 6 will be asymptotically efficient (or best asymptotically nor- 
mal (BAN) in the sense of Neyman, 1949), in a restricted class, namely in the class of estimates 
based on W. Following the min X° estimator of Neyman (1949), the estimator @ was termed 
min O°"? estimator in Singh (1985). Notice that the estimator @ depends on the level « of 
dimensionality reduction via Ay. Thus 6 varies if € does. 

The smoothed estimates of v under Hy based on W can be obtained as follows. Find 6 
which minimizes K(6), i.e. @ is the solution of r equations 


B’Ar(6 — v(@)) = 0 (3.2) 
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where both B( = dv/0@) and v involve 6. An iterative procedure such as Newton-Raphson 
can be used to solve (3.2). Weighted least squares (WLS) estimates or pseudo mle can be 
used as possible initial choices for 6. We can then compute the min Q‘”? estimator of v as 

Get iho). (3.3) 


The asymptotic behaviours of 6 and 6 are given by the following proposition. 


Proposition 3.1 As before, let A; denote (B’A;B) ~!. We have 


(a) 6 — 0 = A7B’Ar(6 — v(0)) ~ MVN(0,A7) 
(3.4) 
(b) 6 —v = BA7B’Az(b — v(6)) ~ MVN(0,BA7B’ ) 
where ‘*‘ = ”’ indicates that the difference between the two sides is negligible in probability. 


The proof follows from the application of the 6-method to the functions B’A;(6 — v(6)) 
and 0 — v(@), which gives 


B’Ar(6—v(8)) — (B’ArB) (9 — 6) = o,(1), 
5 — v(6) — B(6 — 6) = 0, (1). 


From the above proposition it follows that the asymptotic covariance matrix of the min Q‘7? 
estimator 6 is the inverse of the information matrix B’A,;B for 6, which was obtained from 
the approximate likelihood of @ as given by (2.4). It can then be seen that in the absence 
of dimensionality reduction, the estimator 6 will be asymptotically equivalent to the WLS 
estimator of Koch, Freeman and Freeman (1975). As mentioned in the Introduction, the 
WLS estimator generally shows unstable finite sample behaviour because of the inefficient 
estimation of I’. In contrast, the estimator @ for a given € > 0 is expected to show stable finite 
sample behaviour in the sense that it can be approximated well by its asymptotic behaviour. 
This is achieved at the cost of compromising the asymptotic optimality of 6 by restricting 
it to a smaller class, namely the class of estimates based on the first T principal components 
W. The WLS estimator, on the other hand, is asymptotically optimal in a wider class, name- 
ly the class of estimates based on the full data vector 6. If, for a small «, the O'”” test 
statistic indicates insignificance for Hp, then the corresponding min Q‘’? estimator 6 will 
likely provide a robust alternative to the WLS estimator. 


4. MIN QO” ESTIMATES OF UNEMPLOYMENT RATES 


The Canadian labour force survey (LFS) data for October 1980 was analysed by Kumar 
and Rao (1984, 1986) and Roberts, Rao and Kumar (1987). Both sets of authors applied 
the extension of the Rao-Scott adjusted X? method to the case of logistic regression. They 
showed that the logit model given below provided an adequate fit to the survey estimates 
of employment rates (v;,) for the table of 60 cells cross-classified by age (10 categories) and 
education (6 categories). The model is 


Uje 


log ; = By + BA; + BAG + BE (4.1) 


where A; represents the midpoint 12 + 5j for j-th age group (j = 1, ..., 10), and 
E,(¢ = 1, ..., 6) represents the median years of schooling with values 7, 10, 12, 13, 14 and 16. 
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The model (4.1) can be expressed in the notation of Section 2 by numbering the sixty cells 
lexicographically. Thus, (4.1) can be rewritten as h(v) = X6, where v is the vector of employ- 
ment rates, / is the logit function, X is a 60 x 4 matrix whose /-th row is (1, A;, At. E;), 
and @ is (Bo, B), Bo, 63)’. We also have 


Hi=(dh/Ouvye= Der Dida B= Ha, (4.2) 


where D,, and D, _,, are diagonal matrices with diagonal elements given by the subscripts. 
The pseudo mle of @ for the model (4.1) were obtained by Kumar and Rao (1984) under 
the pseudo product-binomial likelihood as 


@ = (— 3510, 0.217, =.0:00218, 011509)". (4.3) 


They also computed Rao-Scott’s first order adjusted X? (G2 in their notation) as 55.3, which 
shows acceptance of the model (2.1) when referred to the X% distribution. 

The Q'7) method was applied for testing (4.1) (see Singh 1985, and Singh and Kumar 
1986) also resulting in the acceptance of the model (4.1). For « = .01, 7 turns out to be 
51 using the estimated covariance matrix I’ as obtained by Kumar and Rao (1984). Now using 
the pseudo mle 6, we have 


O°) (6) = 58.665 = 4.454°= 54.211 (4.4) 
When e = .005, 7 is found to be 54, and 

QO) (6) = 67.774 — 2.343 = 65.431 (4.5) 
Whene = 0, T = 58 because two cells had zero observed unemployment rates. In this case, 

Q°®) (6) = 87.302 — 0.812 = 86.49 (4.6) 


By referring O°!) to the Xj, distribution, O° to a X2, and Q%®) to a X2, distribution, it 
is clear that both O°!) and O*) accept (4.1) while Q%* does not. An instability check can 
be performed by considering the difference O°®) — Q‘") for T = 51, 54, which can be seen 
to be highly significant when referred to the Xx _ 7 distribution. These indicate presence of 
the instability problem in the Q-test statistic that corresponds to no dimensionality reduc- 
tion. It is clear that WLS test would also have an instability problem due to the difficulty 
involved in inverting the matrix I’ which is singular. Thus, min Q'") method would be 
preferable to min Q or WLS methods. In the interests of reducing loss of information, the 
method with the largest value of Tis recommended, providing of course that the correspond- 
ing O'"? shows insignificance for the model. 

We shall now compute asymptotically efficient estimates. Neither min Q nor WLS estimates 
were computed because I was singular. The min QO") estimates @ were computed for 
« = .005 and e = .01 by using the Newton-Raphson iterative procedure and @ as the initial 
estimate of 6 for solving (3.2). The values of 67 and Q‘7? (@) (in this case the negative term 
in (2.6) drops out) for « = .005, T = 54 were obtained as 


654 = (—2.7112, 0.1944, —0.00196, 0.1432)’, and 


Q° (854) = 63.4737 (4.7) 
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Per 2Sa0ly ic 51)-we have 
65, = (—2.6739, 0.19702, —0.00202, 0.1364)’, and 
OS) (65) 55:2518) (4.8) 


Conclusions based on the statistic Q'") (6) for both T = 54 and 51 agree with those obtained 
from O'7) (6). 

Table 1 gives efficiencies relative to survey estimates of unemployment rates 1 — v for 
all cells (excepts two with zero observed unemployment rates) corresponding to the three 
smoothed estimates. The three smoothed estimates are the pseudo mle, min QPay and min 
O°). The pseudo mle variances are taken from Kumar and Rao (1986), while those for min 
O°") estimates are obtained from the diagonal elements of B A; B’ of (3.4). As noted by 
Kumar and Rao (1986) for pseudo mle, smoothed estimates based on min Q‘’? also lead 
to considerable efficiency gains over survey estimates. The relative trace efficiency of smoothed 
estimates over survey estimates is 17.9 for pseudo mle, 18.95 for min O°!) and 19.88 for 
min Q°4) estimates. Thus the min Q‘’? estimators provide a slight improvement in the 


Table 1 


Efficiencies of Smoothed Estimates of Unemployment rates 
relative to Survey Estimates® 


Cell Number Min Q°") Min Q“*) Pseudo mle | Cell Number Min O°!) Min O° Pseudo mle 


1 Soll! 5.74 5.44 Si 9.01 2 8.65 
%4 3.62 B02 3.28 32 8.76 9.46 10.68 

3 3.45 3559 3512 33 36.93 42.93 oH) 
4 52.45 51.65 43.46 34 51555 60.23 S112 
5 104.77 114.30 96.21 55 69.76 To 93 9537 
7 5.33 5.14 4.38 36 aly 11.01 15.07 
8 9.36 9:53 8.09 37 3.48 3.01 3.45 
9 6.85 7.16 6.70 38 13.74 15.91 18.00 
10 25.65 28.40 26.31 39 66.87 80.98 97.30 
11 13.34 14.13 h7c73 40 154.81 187.73 221-50 
12 27.74 30.85 30.85 4] 49.14 67.56 80.61 
13 8.64 8.84 yi Ge 42 732 PANIES 24.98 
14 13.84 13.84 12237 43 8.57 LES 8.49 
LS 8.20 8.49 9.47 44 27.42 31.65 30.74 
16 23.14 24.09 Pad peli) 45 50.55 70.67 City 
17 18.20 18.20 21.49 46 94.11 114.13 121.49 
18 9.87 11.14 12.51 47 S27 12 112.65 108.52 
19 15.87 16.03 13.66 48 26.54 39.41 41.22 
20 11.44 11.98 12.56 49 4.95 Sor 4.41 
an 12739 12539 15.53 50 12 14.10 Li. 17 
22 24.83 24.83 32.02 51 6.75 8.61 7.50 
23 16.43 18.16 Z1NSS 52 8.83 11.45 9.90 
24 6.98 7.83 10.06 53 52.64 71.49 61.14 
fas) 7.49 7.74 6.99 ap) 3.59 ane B 3.03 
26 10.33 L133 12532 56 qed3 8.96 8.23 
OH) 6.47 7.18 8.69 ey) 23.50 29.83 Zo 
28 125.81 140.57 VRE 58 PIA P 294.59 208.77 
29 33.88 38.13 52.00 ey) 6.45 8.82 6.62 
30 14.89 15.24 20.43 60 38.90 52.84 41.96 


* Cells 6 and 54 are omitted due to zero observed unemployment rates. 
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efficiency of smoothed estimates compared to pseudo mle. With regard to performance over 
individual cells Table 1 indicates that the pseudo mle behave very well as compared to effi- 
cient min Q‘’? estimates for the example under consideration. 


5. CONCLUDING REMARKS 


For computing pseudo mle, the working form of the likelihood function corresponds to 
simple random samples (i.e. multinomial or product-multinomial sampling). The pseudo mle 
do provide consistent estimates of model parameters without requiring an estimate of the 
covariance matrix I. However, the pseudo mle are not asymptotically efficient for complex 
survey data. By contrast, the min Q‘”? estimates are asymptotically efficient with respect 
to the class of estimates based on W (the first T principal components of the vector 0 of 
survey estimates). For investigating the relative performance of pseudo mle and min Q"”, 
it would be desirable to perform a simulation study for efficiency comparisons. The min 
QO") estimates do take into account of the underlying complex design by employing an ap- 
propriate I’. If I is not ill-conditioned, i.e. it has no relatively very small eigenvalues, then 
there is no instability problem with the well known WLS estimates which are of course asymp- 
totically efficient. In this case, it will usually turn out that there is no dimensionality reduc- 
tion for small e, that T will coincide with J and that there will be no loss in efficiency of 
min Q‘’) estimates in comparison with WLS estimates. However, given the instability pro- 
blem common with cross-classified categorical survey data, the min Q‘”? estimates are ex- 
pected to provide a robust alternative to WLS estimates. 


ACKNOWLEDGEMENT 


The second author’s research was supported by Statistics Canada and the Natural Sciences 
and Engineering Research Council of Canada. 


REFERENCES 


COX, D.R., and HINKLEY, D.W. (1974). Theoretical Statistics. London: Chapman and Hall 


FAY, R.E. (1985). A jackknifed chi-squared test for complex samples. Journal of the American Statistical 
Association, 80, 148-157. 


IMREY, P.B., KOCH, G.G., and STOKES, M.E. (1982). Categorical data analysis: Some reflections 
on the log-linear model and logistic regression. Part II: Data analysis. International Statistical Review, 
50, 35-63. 


KOCH, G.G., FREEMAN, D.H. Jr., and FREEMAN, J.L. (1975). Strategies in the multivariate 
analysis of data from complex surveys. Jnternational Statistical Review, 43, 59-78. 


KUMAR, S., and RAO, J.N.K. (1984). Logistic regression analysis of Labour Force Survey Data. Survey 
Methodology, 10, 62-81. 


KUMAR, S., and RAO, J.N.K. (1986). On smoothed estimates of unemployment rates from labour 
force survey data. Jn Small Area Statistics: An International Symposium ’85 (Eds. R. Platek, and 
M.P. Singh), Ottawa: Carleton University. 


NEYMAN, J. (1949). Contribution to the Theory of the X° test. In Proceedings of the First Berkeley 
Symposium on Mathematical Statistics and Probability (Ed. J. Neyman), Berkeley: University of 
California Press, 230-273. 


Survey Methodology, June 1987 83 


RAO, J.N.K., and SCOTT, A.J. (1981). The analysis of categorical data from complex sample surveys: 
chi-squared tests for goodness of fit and independence in two way tables. Journal of the American 
Statistical Association, 76, 221-230. 


RAO, J.N.K., and SCOTT, A.J. (1984). On chi-squared tests for multiway contingency tables with 
cell proportions estimated from survey data. Annals of Statistics, 12, 46-60. 


ROBERTS, G.R. (1985). Contributions to chi-squared tests with survey data. Ph.D. dissertation, 
Carleton University, Ottawa. 


ROBERTS, G., RAO, J.N.K., and KUMAR, S. (1987). Logistic regression analysis of sample survey 
data. Biometrika, 74, 1-12. 


SINGH, A.C. (1985). On optimal asymptotic tests for analysis of categorical data from sample surveys. 
Working Paper, Social Survey Methods Division, Statistics Canada. 


SINGH, A.C., and KUMAR, S. (1986). Categorical data analysis for complex surveys. Proceedings 
of the Section on Survey Research Methods, American Statistical Association, (forthcoming). 


Survey Methodology, June 1987 85 
Vol. 13, No. 1, pp. 85-92 
Statistics Canada 


A Sampling Procedure with Inclusion 
Probabilities Proportional to Size 


A. DEY and A.K. SRIVASTAVA! 


ABSTRACT 


A new unequal probability sampling scheme for selecting n (> 2) units without replacement from a 
finite population is proposed. This scheme ensures that the inclusion probabilities are proportional 
to sizes. It has the advantage of simplicity in selection and estimation and also provides a non-negative 
variance estimator. The variance of the Horvitz-Thompson (H-T) estimator under the proposed scheme 
is shown to be smaller than that of the customary estimator in probability proportional to size sampl- 
ing with replacement. The proposed scheme also compares favourably with the without replacement 
scheme suggested by Sampford (1967) in an empirical study on a few natural populations. 


KEY WORDS: Unequal probability sampling; Horvitz-Thompson estimator. 


1. INTRODUCTION 


In unequal probability sampling of n units without replacement from a finite population 
containing N units, if 2; denotes the inclusion probability of the i-th unit in the sample 
i= 1,2, ..., N, the Horvitz and Thompson (1952) estimator (H-T estimator) of Y, the 
population total of the study variable y, is given by 


Y= VY (yi/m), (1.1) 
i€s 
where y; is the y-value for the i-th unit and the summation extends over the units included 
in the sample. The variance of Y is 


N oN 
Von). Dae ME (a;7; - Tj) (y;/ 7; - yj/mj)P (1.2) 


i=1 j>i 


where aj; denotes the joint inclusion probability of the i-th and j-th units in the sample 
eer fi 12. asa, IN): 

Considerable reduction in the variance of Y can be expected if the sampling scheme en- 
sures that 7; are proportional to a given measure of size, say, x; fori = 1, 2, ..., N, where 
it is assumed that x; are nearly proportional to y;. Sampling schemes in which 7; are pro- 
portional to x; are termed Inclusion Probability Proportional to Size (IPPS) schemes. For 
a comprehensive account of unequal probability sampling procedures, including IPPS sampI- 
ing schemes, the reader is referred to the monograph of Brewer and Hanif (1983). 

Some desirable properties of an unequal probability scheme without replacement in general, 
and IPPS schemes in particular, are simplicity in selection and estimation, availability of 
a non-negative variance estimator, and better efficiency than with the probability propor- 
tional to size (PPS) with replacement strategy. Unfortunately, for sample size greater than 
two, not manv of the available procedures meet these requirements fully. 


' A. Dey and A.K. Srivastava, Indian Agricultural Statistics Research Institute, Library Avenue, New Delhi 110012, 
India. 
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In this paper, an IPPS sampling scheme is suggested for arbitrary sample sizes, n > 2. 
The procedure is rather simple both in sample selection and at the estimation stage since 
compact expressions for 7, are available. It has also been possible to provide a positive 
estimator of variance of the H-T estimator of Y. The performance of the H-7 estimator 
under the proposed scheme is compared with the PPS with replacement strategy and a sim- 
ple sufficient condition is derived under which the performance of the former strategy is 
superior to that of the latter. An empirical study on a few natural populations indicates that 
the proposed scheme compares favourably with that suggested by Sampford (1967). 


2. THE SAMPLING PROCEDURE 


Consider a population of N units with y as the study variable and x, an auxiliary variable, 
as the size. It is assumed that x-values are known for all the population units. A sample of 
size n( > 2) is to be selected. To start with, it is assumed that n is even. 

Divide the population into m( > n/2) groups such that the i-th group contains N;( > 2) 
nits: (7 = 1-2" 52, cand. for.cachotoup: 


XIX Son — 2)/ [nn 1) I, (2.1) 
where 
Ni 
Xx; = » Xi? 
Vi) | 
Xj, is the value of x for the w-th unit in the i-th group and X = X; + X2, + ... + Xm. 


Equation (2.1) is satisfied if the X; (¢ = 1, 2, ..., m) are made nearly equal. It has been 
seen in actual populations, considered by Rao and Bayless (1969) and others, that this con- 
dition is satisfied for quite a few values of m if the groups are so formed that their sizes, 
X;, are nearly equal. Rao and Lanke (1984) suggested a grouping procedure in which N units 
are grouped into R groups such that group totals, X;, are nearly equal and group sizes are 
either [N/R] or [N/R] +1, where [x] is the largest integer contained in x. For the forma- 
tion of groups, the Rao-Lanke procedure may also be tried. 

Having formed the m groups, the suggested sampling procedure consists of the following 
steps: 


Step 1. Select n/2 groups out of the m groups using Midzuno’s (1951) sampling procedure 
with probabilities {P/}, that is, select one group with probability 


Pea hie eee i te) a 11), WI Ay ks 
and the remaining (1/2) — 1 groups with equal probabilities without replacement. 


Step 2. From each of the selected groups, select two units by any IPPS procedure, say by 
Durbin’s (1967) procedure, that is, in the i-th selected group (i = 1, 2, ..., n/2) 
select one unit with probability 


Di,,|i = Xi,/ Xi, 
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and the second unit with revised probability 
Dighie— Xi, LAG = 25) tb 1/ (CX; — 2xy 1 0), 
where 


Ni 
Dahle Cer) 2 
u=1 
For this sampling procedure, the inclusion probability for the i,,-th unit is evidently given by 


=NDi, (2.2) 


where 
Di, = %i,,/X- 
Also, the joint inclusion probabilities for a pair of units are given by 


Pi {PI Pre Pg Pip (2.3) 
Tv: >; = —. eo : 
wD AP; = 204 ti) lh Fa? OP, 


and 


= Le eee PCP tteP 1 2.4 
SN Pog (aaiytn Ter hii Chace bis (2.4) 


bra, 1 Jae lee a in. 


Thus we see that the proposed scheme is indeed an IPPS scheme. 

As mentioned earlier, at step 2 of the proposed procedure, any IPPS scheme for selecting 
two units can be used. Since the procedure of Durbin (1967), which is equivalent to those 
of Rao (1963) and Brewer (1963), generally performs well, it has been adopted at step 2. 


3. A VARIANCE ESTIMATOR 


Two well-known unbiased estimators of Var( Y) are due to Horvitz and Thompson (1952) 
and Yates and Grundy (1953). Both these estimators, however, suffer from the drawback 
that they sometimes assume negative values. In this section, a positive estimator of variance 
is proposed that utilizes the two-stage nature of the proposed sampling scheme. 

Using a result due to Des Raj (1966), an unbiased estimator of Var(Y) is given by 


ie Ti, |i Tiy|i 7 ee yaaliz 
Vv 4 ee -1 u vy ‘7 1 u ee v 
( ) Lu oe oe | [= Tiy| i 
l= 


u<y Tiyiy|i 
n/2 n/2 ' Y. G aD 
ae cart 
+ = aerelth it m= ae (3.1) 
ud (rs bh it) 
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where 


n(n — 2) 
= ee Pret Pye] = oe il yes 
tie pee ral Nas Uc og 


= 2 pj, /P; 


a 
= 
| 


2 Di Dini sant meee) 
D;P;(P; —2p;,) (Pi — 2pi,) 


2 
and Y; = ey Dil Wits 5 (3.2) 


Ti ili = 
ly! I 


y;,, being the y-value of the w-th unit in the /-th group. 


The two terms in the right side of (3.1) correspond to the Yates-Grundy variance estimator 
in Durbin’s and Midzuno’s procedures. Since under these two sampling procedures the Yates- 
Grundy estimator of variance is always positive, it follows that the variance estimator given 
by (3.1) is also positive. However, the estimator in (3.1) is neither the Horvitz-Thompson 
nor the Yates-Grundy variance estimator. 


4. COMPARISON WITH PPS WITH REPLACEMENT STRAGEGY 


In this section, we compare the efficiencies of the following two strategies: 


Strategy 1. The proposed sampling scheme in conjunction with the Horvitz-Thompson 
estimator. 


Strategy 2. PPS sampling with replacement in conjunction with the customary estimator. 
Strategy | is more efficient than Strategy 2 if and only if 


Ni 


m 
Lu YY ty Wi,/Piy — YO%/P, — Y 


uy 
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After some lengthy but routine algebra, the inequality (4.1) boils down to 
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where Y; = yi Yi, 
u 
Obviously, (4.2) holds if 
G7) (2n —-m-—-2)>0, and 


Gy” Pe> Gr —2)/((m — 1)@n —m — 2)]: (4.3) 


Also, since we are using Midzuno’s procedure at the first stage with revised probabilities 
{P/ }, each P; must satisfy (2.1), that is, each P; must satisfy 


Pree teav) / in ne 1)). 


Thus, (4.2) holds if 
ima (nt — 2). (4.4) 


It appears, therefore, that for Strategy 1 to be superior to Strategy 2, m should be chosen 
such that 


nid? <ym = (n-— 2). (4.5) 


However, it is clear that (4.4) is merely a sufficient condition and is not necessary. For 
n > 6, condition (4.5) offers a somewhat wide choice for the value of m, while forn = 6, 
(4.5) implies that m = 3. Forn = 4, (4.5) does not lead to a feasible value of m. Therefore, 
forn = 4, an investigation into the performance of Strategy 1 has been taken up for various 
values of m, not constrained by (4.5), on certain natural populations. A description of the 
populations appears in Table 1. Table 2 presents the relative efficiency of Strategy 1 com- 
pared to Strategy 2 for the populations in Table 1. The performance of the H-T estimator 
under Sampford’s (1967) scheme (called Strategy 3) is also compared with that of Strategy 2. 

It can be observed from Table 2 that the performance of the proposed strategy (Strategy 
1) compares favourably with that of Sampford (Strategy 3) for most of the populations. Of 
course, both strategies are superior to Strategy 2. 

To achieve the relative efficiency of Strategy 1, the units were grouped in an ad-hoc man- 
ner, ensuring only that requirement (2.1) was satisfied. The procedure of Rao and Lanke 
(1984) was also attempted in forming the groups. However, the Rao-Lanke procedure did 
not always result in a high efficiency. Further investigations are necessary to decide the ‘best’ 
choice of groups. For certain populations, suitable groups satisfying (2.1) could not be formed 
for higher values of m, and thus, for these cases, the relative efficiencies are not reported 
in Table 2. 

In conclusion, a brief comment on cases in which the desired sample size, n, is odd is 
in order. An IPPS sample for odd n may be obtained by selecting (nm + 1) units by the sug- 
gested procedure and then randomly discarding one unit. The expressions for z; and Ti, 
under this procedure are straghtforward. Obviously, when one of the sample units out of 
(n + 1) is discarded at random, the resulting sample consists of two units from each of the 
(n — 1)/2 groups and just one unit from one of the groups. An unbiased and positive 
estimator of Var(Y) can be obtained, analogous to (3.1), on the basis of the (n — 1) /2 
groups, each containing two units in the sample. 
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Pop. 
Number 


APA 


13s 


14, 


Source 


Des Raj (1965) 


Rao (1963) 


Cochran (1963, 
p. 204) 


Hanurav (1967) 


Hanurav (1967) 


Hanurav (1967) 


Hanurav (1967) 


Cochran (1963, 
(Ds 225))) 


Cochran (1963, 
peilS6, cities 
1-16) 


Cochran (1963, 
D. 156, cities 
33-49) 


Sampford (1962, 
p. 61) 


Sukhatme and 
Sukhatme (1970, 
p. 256, circles 
1-20) 


Sukhatme and 
Sukhatme (1970, 
p. 256, circles 
21-40) 


Yates (1960, 
p. 163) 
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Table 1 


Description of the Populations 


N 


20 


14 


10 


20° 


19 


17 


35 


20 


20 


20 


y 


Number of 
households 


Corn acreage in 
1960 

Weight of 
peaches 
Population in 


1967 


Population in 
1967 


Population in 
1967 


Population in 
1967 


Number of persons 
per block 


Population in 
1930 


Population in 
1930 


Oats acreage 
in 1957 


Wheat acreage 


Wheat acreage 


Volume of 
timber 


Eye-estimated 
number of 
households 


Corn acreage 
in 1958 


Eye-estimated 
weight of 


peaches 


Population in 
1957 


Population in 
1957 


Population in 
1957 


Population in 
1957 


Number of rooms 
per block 


Population in 
1920 


Population in 
1920 


Oats acreage 
in 1947 


Number of villages 


Number of villages 


Eye-estimated 
volume of timber 
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Table 2 


Percent Relative Efficiencies of 
Strategies 1 and 3 over Strategy 2 for the 
Populations in Table 1 (n = 4) 


Pop. Strategy 1 
Strategy 3 

Number es 4 5 6 

if 130.1 118.7 120.8 124.5 127.8 
ie 132.6 130.2 — - 127.1 
an 149.1 ~ = ~ 147.9 
4. 120.7 120.6 12 2a 129.7 117.8 
=f 129.1 138.7 15827 - 125.1 
6. 158.0 iy eis | - — 139.5 
be 151.9 144.8 169.2 — 131.9 
8. 168.5 = - — 145.5 
9. 118.3 116.3 — - 109.5 
10. 126.6 _- — —- 1232 
il 113.8 16:2 135.6 129:9 113.8 
1. 117.4 128.0 119.0 - 119.3 
13: 1222 120.6 — — 119.7 
14. 124.8 12351 115.4 {13:2 116.3 
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Sample Design for the Health and Activity 
Limitation Survey 


D. DOLSON, K. McCLEAN, J.-P. MORIN, and A. THEBERGE! 


ABSTRACT 


The Health and Activity Limitation Survey is part of the program to establish a data base on the disabled 
population in Canada. The sample design used for the part of the survey covering the population not 
living in institutions is described. In addition, the methods used to determine the sizes of the samples 
and to select the samples are presented. 


KEY WORDS: Disability; Stratified sampling; Two-stage sampling; Optimum allocation; Sampling 
without replacement. 


1. INTRODUCTION 


As part of the program to obtain more information about Canada’s disabled population, 
the Health and Activity Limitation Survey (HALS) was conducted in the fall of 1986. It is 
designed to obtain information concerning the nature of the problems experienced by that 
population and, in general, their daily activities (at home, at work, at school, during travel, 
and so on). The survey is divided into two parts: one covers the population living in institu- 
tions and the other, which is the subject of this article, covers the non-institutional population. 

Canada has been divided into 238 subprovincial areas (SPAs). All Quebec and Ontario 
municipalities with more than 125,000 residents and all municipalities in the other provinces 
with more than 75,000 residents are included as SPA’s. The other areas are made up of 
groups of census subdivisions respecting geographical contiguity and the provincial bound- 
aries. The number of these areas in each province is proportional to the square root of the 
population, minus the previously defined municipalities. One of the main objectives of the 
survey is to generate statistics on the disabled population at the SPA level so that the popula- 
tion’s various needs can be analysed in detail. In addition, estimates will be produced for 
three age groups - namely, children (under 15 years of age), adults (15 to 64 years of age) 
and seniors (65 years of age and older). 

The data was collected in two stages. The first stage involved a multipart question (question 
20) included on form 2B of the 1986 Canadian Census of Population. This question asked 
about the respondents’ limitations in various types of activities and their own assessments 
of their conditions. A copy of question 20 is given in the Appendix. The second stage was 
implemented some time after the census. It involves a screening questionnaire and follow-up 
to collect information on the problems and activities of disabled respondents. 

The main purpose of the first stage is to separate respondents into two groups: those who 
answered ‘‘yes’’ to at least one part of question 20 and those who answered ‘‘no’’ to all 
parts. The aim is to identify beforehand a large part of the potential disabled population, 
in order to focus survey resources on the target group. However, previous surveys have shown 
that this question will not identify the entire target population. (See Dolson et a/. 1984 and 
Dolson ef al. 1986.) 


' D. Dolson, K. McClean, J.-P. Morin, and A. Théberge, Social Survey Methods Division, Statistics Canada, Ottawa, 
Ontario, KIA OT6. 
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The second stage is HALS. Personal interviews are conducted for the ‘‘yes’’ stratum and 
telephone interviews are conducted for the ‘‘no’’ stratum. From an operational point of view, 
the interviews are in two parts — the screening questionnaire and the follow-up. 

The screening questionnaire is designed to identify respondents for whom the follow-up 
questionnaire is relevant. The questionnaire for adults covers the seventeen activities of daily 
living (ADLs) used in the Canadian Health and Disability Survey in 1983 and 1984, repeats 
Part (a) of question 20 from the Census, and includes a few questions on mental illness and 
handicaps (see the Appendix). If an affirmative answer is given to at least one of these ques- 
tions, the interviewer proceeds with the follow-up; if not, the interview is terminated. Part 
(a) of the Census question is asked again because there may have been a change in status, 
either because the response in the Census was given by a proxy, or because the respondent 
has reassessed his or her own condition. 

The screening section in the questionnaire for children includes questions on special aids, 
activity limitations, attendance at a special school and health conditions or problems. A ‘‘yes”’ 
answer to at least one of these questions prompts a follow-up interview. The Census question 
is not repeated because all interviews regarding children require a proxy and the question 
on activity limitations is equivalent to Part (a) of Census question 20. 

The second section of this article describes how the population of Canada has been 
divided into various subpopulations for estimation purposes. The third section covers the 
HALS sample design. The fourth section deals with the file of geographic information and 
projected demographic data for 1986 that was used to create the survey frame. The fifth 
section explains how the sampling was done. 


2. POPULATIONS COVERED 


Permanent residents of general and psychiatric hospitals, special care centres or institu- 
tions for the elderly or chronically ill, institutions for the physically handicapped and 
orphanages or children’s homes are the subject of a distinct part of the survey - namely, 
HALS (Institutions). This article will look at the part of the survey covering that portion of 
the Canadian population not covered by HALS (Institutions) and not residing in jails, military 
camps, young offender facilities, naval vessels, penal or correctional institutions and collec- 
tive dwellings in the ‘‘others’’ category (for example, circuses and non-religious communes). 

Each enumeration area (EA) whose population is not totally excluded from the survey 
is classified in one of the following five survey frames: 


1. Indian reserves where the 1981 Census was conducted using canvassers; 

. Other Indian reserves; 

. Canvasser EAs; 

. EAs in the Whitehorse, Yellowknife, Pine Point, Hay River and Fort Smith SPAs; 
. All other EAs. 


The order of priority for belonging to a frame is 1-2-4-3-5. This means that an EA that 
is an Indian reserve and situated in the Whitehorse SPA is classified as an Indian reserve. 
Each EA is divided in two, with the ‘‘yes’’ EA made up of those persons who would answer 
yes’’ to the Census question, and the ‘‘no’’ EA made up of those who would answer ‘‘no’’ 
to it. A different sample design is used for each of the five survey frames: all of the ‘‘yes’’ 
EAs and none of the ‘‘no’’ EAs are selected in the first frame; all of the ‘‘yes’’ EAs and 
a sample of the ‘‘no’’ EAs are selected in the second frame; none of the ‘‘no’’ EAs and 
a sample of the ‘‘yes’’ EAs are selected in the third frame; all of the EAs are selected in 
the fourth frame; and a sample of the ‘‘yes’’ EAs and a sample of the ‘‘no’”’ EAs are selected 
in the fifth frame. 


Ab wWN 


66 
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3. SURVEY DESIGN 


The sampling method presented in this section was used for survey frames three and five. 
Because our space is limited, the sample design used for the second survey frame will not 
be described in this article. (For more information on the HALS methodology, see Dolson 
et al. 1986.) 


3.1 Sample Design 


Each province is divided into subprovincial areas (SPAs), which are themselves divided 
into enumeration areas (EAs). 

Each EA is divided into a ‘‘yes’’ EA and a ‘‘no’’ EA, the first containing those persons 
who would answer ‘‘yes’’ to Census question 20, the second containing those persons who 
would answer ‘‘no’’ to that question. In each SPA, the ‘‘yes’’ EAs are stratified into large 
and small EAs on the basis of the criterion explained in the fourth section of this paper. 
Persons belonging to a ‘‘yes’’ EA are associated with a stratum and an SPA in addition to 
their EA, while persons belonging to a ‘‘no’’ EA are associated only with their EA. In each 
province, the population is subdivided into three age groups: children (under 15 years of 
age), adults (15 to 64 years of age) and seniors (65 years of age and older). 

The sampling method involves using a two-stage stratified sample design for the ‘‘yes”’ 
EAs in each SPA and a two-stage sample design for the ‘‘no’’ EAs in the province. The 
primary units are the EAs and the secondary units are the respondents. 

All persons who completed Census form 2B in a ‘‘yes’’ EA selected for the sample are 
interviewed, along with a third of those in the ‘‘no’’ EAs selected. 


3.2 Sample Allocation 


This sample design must allow us to minimize sampling costs for a given maximum coef- 
ficient of variation of the estimates and a given variance for the estimator B of the relative 
bias B. We define B as the ratio of the number of ‘‘no’’ persons with a characteristic of 
interest in the province, 7), to the number of ‘‘yes’’ persons with a characteristic of interest 
in the province, 7,. By ‘‘no’’ person, we mean an individual who would answer ‘“‘no’’ to 
all parts of Census question 20, and by ‘‘yes’’ person, an individual who would answer ‘‘yes’’ 
to at least one part of the question. 


PROV 


Figure 1. Illustration of Sample Design. 
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Let No be the number of ‘‘no’’ EAs in the province; N;,, the number of “‘yes’’ EAs in 
stratum j and SPA & in the province; mp and nj, the corresponding sample sizes; and co 
and cj, the corresponding unit sampling costs. If we have an N, SPAs in the province, we 
therefore want to minimize 


Np 


yy (Cig Miz + Cop Nox) + CoN 


: k=1 
given 


CVA CVV ate esaNiate(B); 
Nig = Niki Max = AKMizs No = No 
CPE TES ONS 


where ), is the ratio of the expected number of disabled persons in the small EAs to the 
expected number of disabled persons in the large EAs of SPA k, y, is the estimated number 
of ‘‘yes’’ persons who have a characteristic of interest in SPA k, and values marked with 
an asterisk are constants. 

If the sampling fraction in the ‘‘yes’’ EAs is f,;, Mj, is the number of ‘‘yes’’ persons in 
FA i of stratum j of SPA &k in the province and pj is the probability of a characteristic 
of interest for a ‘‘yes’’ person in EA i of stratum j in SPA k, then 


2 Nik 
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After a few algebraic manipulations, we obtain 
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We can therefore write CV7(y,) as 


Var (Ye) _ Ak 


CVG, = 
1G Nik 


EV Bus (3.1) 


Furthermore, B (the relative bias) and B (its estimator) are given by 


No 
As) MioPio 
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T; Np 
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where Mj is the number of ‘‘no’’ persons in EA i in the province and pjg is the probability 
of a characteristic of interest for a ‘‘no’’ person in EA 7. 
Assuming that f) and ¢, are independent, then 


(3.2) 


Var (B) = B Gee Var -) 


T6 Tj 


After a few algebraic manipulations, if fp is the sampling fraction in the ‘‘no’’ EAs, we 
obtain 
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Furthermore, assuming that the y,’s are independent, we have 
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Using equation (3.1), this expression can be written as 
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The optimization problem can be re-expressed as the problem of minimizing 
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In practice, rather than using by = min (Nix, No, /A,) we define 
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while, if \,m, > No, sample sizes are given by nx, = N>, and 


nape meee (Acnk = Noi) Nice 
Nox 

Thus, we consider N3,/ (Ni;,A,) small EAs to be equivalent to one large EA. On average, 
there are as many disabled persons in one large EA as in N>;/ (Ni,A,) small EAs. 

Proceeding in this way, it is not always true that n., = \,\,. However, we avoid CVs 
higher than target values, when, for example, small EAs remain to be observed (even if all 
the large EAs have been selected). 

For some values of k, it is possible that a, = by. If this is the case, we set n, = b,. Let 


E, = {k = 0,1,2,...,Np| me = ax}, 


Ey = {A Op 12 otteats IN.) Hipge 1p o> ax}, 
E, = {k = 0,1, 2,...,Np| ae < ne < bg, 


EF, = {k = 0, te Ne i by, < ay}. 


The solution exists if 
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What are the sets E), E>, E; and E, corresponding to the solution? Set EF, is easy to 
determine. We must have 


dy < (dy) 7K < by (KeE3), (dele) “K =; (Keb), 


(di. /¢,) "Kid, ahek, ye (3.9) 


Determining the sets involves trying each of the possibilities for E,, E, and F; until a 
value for k which satisfies (3.9) is obtained. To reduce the number of possibilities to be ex- 
amined, note that, if for k’ =k, 


bi (ch (dt)? = Dy (Cel ay) (ke e{0, 1... .Np}), (3.10) 


then there is a K* such that E, = {0,1,2,...,k*}, or Ey = { }, while, if for k’ =k, 


Gide di) A= Gree doe mie Kae alsa: Nollie (3.11) 


then there is a K** such that E, = (k**, k** + 1,...N,} or E, = { }-. 


3.3 Parameter Estimation 
To calculate the optimum sample allocation, the following quantities must be determined: 
P, = proportion of HALS screened-in individuals who replied ‘‘yes’’ to Census question 20, 


P, = proportion of HALS screened-out individuals who replied ‘‘yes’’ to Census question 
20, and 


P; = proportion of HALS screened-in individuals who replied ‘‘no’’ to Census question 20. 


Since these parameters cannot be computed directly using data from the Canadian Health 
and Disability Survey, a test called the ‘‘calibration study’’ was carried out in September 
and October 1985. 

Census question 20 was included, without abbreviation, as a supplementary question in 
the September Labour Force Survey (LFS). It was asked to a sample of approximately 36,000 
individuals. The questions on the 17 ADLs and a question on mental handicaps were added 
as a supplement to the October LFS and were asked of the same individuals. 

For each five-year age group, the weighted values from the calibration study were used 
to estimate the probability of an affirmative response, P (yes), to Census question 20. The 
HALS screening questionnaire differs from that used in the calibration study. In HALS, 
there are more questions on mental and psychological problems and part (a) of Census ques- 
tion 20 is asked again. Therefore, we did not depend on the calibration study alone to calculate 
the parameters. 
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4. 1986 GEOGRAPHIC AND DEMOGRAPHIC FILE 


4.1 Description of Available Information 


When the sample allocation was done in the spring of 1986, the following information 
was available for use in calculation of population projections by age group and EA: 


1. population projections by age group and province in 1986; 
. estimated population by age group and CD in 1984; 


2 

3. population by age group and EA in 1981; 

4. conversion file to establish the correspondence between the 1981 and 1986 EAs; 
5 


. estimated numbers of dwellings by EA in 1986. 


The conversion file is structured according to the concept of equivalent sets. Each equivalent 
set is the smallest region consisting of EAs that has not had its boundaries altered. For ex- 
ample, if three 1981 EAs were reorganized as two 1986 EAs, the group of three 1981 EAs 
(or the group of two 1986 EAs) is an equivalent set. 

The four methods described in the next subsection are designed to produce population 
projections by age group and by equivalent set in 1986. If an equivalent set is made up of 
several 1986 EAs, the projected population for the equivalent set can be divided propor- 
tionally among the EAs using the estimated numbers of dwellings by EA in 1986. 


4.2 Estimation Methods 
For province p, let 
ES;, = the /-th equivalent set of the k-th CD (/ = 1,2,...,Nye3k = 1,2,...,Np), 
ES) «:3:(/) = population of ES;, in the j-th age group in 1981 (j = 1,2, ..., 16), 
CD,.4(/) = estimated population of the k-th CD in the j-th age group in 1984, 
Px6(j) = projected population in the j-th age group in the province in 1986. 
For the three methods that follow, the first step is to calculate CDae (j), the projected 


population of the j-th age group in the k-th CD in 1986. We assume there exists K/ 
we), 2,...., 16) such that 


CDx.36(/) = Kj (CDx.34/)) (k = ie 2 REN sad. = I, 23 tty 16), 


4 


CDyg6) = Pes) Goa 1.2, ...,16). 
1 


> 
i} 


This implies that 

Px (j) CDx;84(/) 
Np : 
3 CDy.:34/) 


k=1 


CDx 36 = 
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The first method of estimating E'S) ,.36(/) involves assuming the existence of Kj ¥ = 1, 


..., 16) such that 


ES, x-96/) = KjESi¢i10) (=A Nas = Line 5,16), 


Nie 
Y) ES; 4864) = CDeg6V) VU = 1,2, ..- 16). 


l=1 
We will say that this method uses the simple model. We obtain 


an CDy.36 (J) E'S), «81 VY) 
ES) ;36U) = nx; 
yy FSi 4:81) 


l=] 


[ieee teeN ose Lt maaan 


With this simple model, the estimated total population of ES); in 1986 is 


tO CD rag) Sica) 
Nk 4 
Tee Ne ES) x:31 VY) 


l=1 


If one thinks that a better estimate, See (ior) of this quantity can be produced by in- 
dependent means (for example, using the estimated number of dwellings in ES,;, in 1986), 
then more elaborate models can be used to estimate E'S; ,.36(/). The multiplicative model 


is specified by the following equations: 


ES) x36) = Kj (ES;x:31V)) + Goo (SPR p abe lO: 


Nk 
Vy ES; 4864) = K(CDegsVU))  G = 1, «--, 16), 
l= 

Nk 

Cy Or 

=| 
16 ye se 
Se ES) 7.36) = ESpieg (fot) CU SH 1. 5 Ne). 


f= 


One can interpret e, as the net intra-CD migration for the /-th equivalent set. 
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The third model, called the additive model, is given by the following equations: 


Poe) — ES p31) tee ey = Learn sy = 1,20 216); 


Nk 


YY) FSi x:86(/) 


l=] 


OD 36) DD” Wes i, .2., 16)? 


16 
Ma ES) x36) = ESzxge(tot) (1 = 1, ..., Nx). 
roi 


This model involves the assumption that the population increases (or decreases) for each 
age group in each of the equivalent sets in a CD can be decomposed into two terms - one 
which depends only on the equivalent set and not on age (e,), and one which depends only 
on age and not on the equivalent set (/}). 

A final trivial model involves simply formulating 


ES) = BS Gy CL =Vir aoe) Nes Sls oa 16). 


4.3 Evaluation of Estimation Methods 


The four methods were evaluated using data for the period 1976-1981. We used the 1976 
projection of the population by age group and province in 1981, (Px, (/)), the population 
by age group and EA in 1976, a 1976-1981 conversion file and the pre-Census estimate of 
the number of dwellings per EA in 1981. Since there are no estimates for population by age 
group and CD in 1979 (the equivalent of CD,.34(/)), we set 


A PMC) CDEaC 

CDy.31 = CNA ne Sa 
yy CDx:84V) 
K=1 


For ee (tot), which is needed for the multiplicative and additive models, we used 


16 16 
YS FSie76) .  CDx;81 (/) 
j=! j=! 

Nz 16 


Lu X ES) x76) 


ES) x:3;(tot) = 
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Table 1 
Comparison of the Four Methods 


Prov. EFF EFF y ERR 
Nfld. 0.890 0.891 0.887 
| Seed Bil & 0.903 0.914 0.919 
N.S. 0.960 O72 O.912 
N.B. 0.870 0.868 0.884 
Que. 0.778 0.764 0.818 
Ont. 0.932 0.930 0.916 
Man. 0.892 0.904 0.912 
Sask. 0.732 0.749 0.801 
Alta. 0.818 0.827 0.860 
B.C. O13 0.716 0.775 
Yukon 0.770 0.768 0.840 


IN.WL. 1252 1.246 1G bay 


For each province p, an efficiency measure was calculated for the simple, multiplicative and 
additive models relative to the trivial model: 


e 
eS 
ES 


cos 2 
ye (Bir - ES.) 


ae 
lI 
= 
cS 
| 
= 


EFF, = (m = S, M, A), 


Z| i 
Zz 
> 
a 


A 2 
De (Bi2aw 7 E8141) 


j= 


> 
lI 
_ 
~ 
ll 


where ES\™, (j) with m = S, M, A and T are the projections obtained by means of the 
simple, multiplicative, additive and trivial models respectively. Some values obtained are 
given in Table 1. 

The simple model gives the worst results for one province and one territory, the multi- 
plicative model for two provinces and the additive model for seven provinces and one territory. 

The simple model is the best for five provinces, while the multiplicative model is best for two 
provinces and one territory and the additive model is best for three provinces and one territory. 

Since the simple model also has, as its name implies, the advantage of simplicity, it is 
the one that was chosen. 


4.4 Method of Stratification by Enumeration Area Size 


If simple random sampling were used to select EAs within each subprovincial area (SPA), 
disabled persons belonging to an EA with many disabled residents would have less chance 
of being selected than those in a small EA - that is, an EA with few disabled persons. To 
avoid excessive differences in selection probabilities, the population of EAs in each SPA 
is stratified according to the number of disabled persons in the EAs, and then proportional 
allocation is used. With proportional allocation, the number of EAs selected is proportional 
to the number of disabled persons for each stratum. 
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Using the results of earlier surveys, a link was established between the age distribution 
of the population of an EA and the number of disabled persons expected in the EA. Since 
the number of disabled persons is unknown, the variable used for stratification and sample 
allocation is the expected number of disabled persons. 

In the case under consideration here, there are only two strata - one for large EAs and 
one for small EAs. Since proportional allocation is being used, we employed a criterion found 
in Raj (1968) to determine the optimum dividing line between large and small EAs. This 
criterion gives the optimum dividing line as the average of the average size of the small EAs 
and the average size of the large EAs. 


5. SAMPLE SELECTION 


It was necessary to draw samples for the three populations (children, adults and seniors) 
among the large and small ‘‘yes’’ EAs of each SPA, both for frame three and for frame 
five, and among the ‘‘no’’ EAs of each province for frame five. When an SPA contained 
fewer than two large EAs or fewer than two small EAs, we selected all of the EAs in that 
SPA for the three populations. The ‘‘yes’’ and ‘‘no’’ samples were created independently, 
using the one-pass algorithm described by Bebbington (1975). The samples from the three 
populations for the ‘‘yes’’ and ‘‘no’’ components were nested to minimize the total number 
of EAs selected. 

The following table shows the sizes obtained for the samples by province for each age 
group. 


Table 2 
Sample Sizes by Province and Age Group 


Children Adults Seniors 


Province Number of Number of Number of Number of Number of Number of 
“*yes’> EAs “no”? EAs ‘ves’? EAs ““no’’ EAs **yes’’ EAs “no”? EAs 


selected selected selected selected selected selected 
Nfld. 880 136 405 154 476 173 
PEI, 242 242 111 ZAG, 82 166 
N.S. 1257 [57 434 130 438 115 
N.B. 1142 162 459 146 453 138 
Que. 4749 153 1070 114 1488 133 
Ont. 6085 158 1304 116 1542 120 
Man. 1082 203 457 169 367 144 
Sask. 2291 265 942 241 921 193 
Alta. 2762 190 909 176 1389 PIDs 


B.C. 3117 170 752 125 948 119 
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6. DISCUSSION 


The postcensal survey is a relatively new survey method that will no doubt undergo extensive 
development in the next few years. This type of survey allows for a great deal of flexibility in 
data collection and use of large samples scattered throughout the country, with reasonable costs 
and timeliness. The Health and Activity Limitation Survey is the first postcensal survey of its 
size in Canada. 

The sample design presented in this article is an attempt to maximize use of the opportunities 
offered by the postcensal approach, with optimum use of the available resources. One of the 
major problems inherent in the proposed method is control of sample size. Sample allocation 
is determined before the census is taken; this means that all calculations must be done using pro- 
jections based on the previous census. In this context, the actual size of a sample made up of 
a group of small areas selected on the basis of the projection results may vary considerably from 
its expected size. 

Therefore, on the one hand, one may obtain a sample that is inadequate with respect to the 
quality requirements for the estimates. On the other hand, the resources allocated to data 
collection may be exceeded. In order to prevent these problems, we implemented the following 
strategy. A target number of interviews for each population was calculated for the ‘‘yes’’ sample. 
This number was based on the sample size required to produce estimates that would satisfy our 
quality criteria. However, for the reasons mentioned above, we selected more EAs than were 
necessary to obtain the target number of interviews. For reasons of cost, if the real number of 
interviews to be conducted, as calculated in the field, was higher than the target number, a sub- 
sample of EAs were excluded from the survey. Only for the Halifax Regional Office (covering 
Prince Edward Island, Nova Scotia and New Brunswick) was the number of interviews in the 
‘yes’? sample substantially higher than the target number. The decision was therefore made to 
exclude certain EAs from this part of the sample. In order to know which EAs would be excluded, 
it was necessary to know the target number and the real number of interviews for each EA. For 
40 per cent of the EAs, the real number of interviews had to be imputed since this information 
was not available in time. 

For this imputation, the total real number of interviews was known for each census 
commissioner district. The portion of this total not already allocated to EAs with known numbers 
of interviews was distributed among the EAs requiring imputation, in proportion to the target 
number of interviews. 

We then calculated, for each population, the difference between the real number and the target 
number of interviews for each of the two strata of each SPA. A positive difference (real-target) 
indicated a population for which some EAs could be excluded from the survey. In each stratum, 
the EAs were divided into three groups (1, 2 and 3), in accordance with whether they had been 
selected for three, two or only one of the populations respectively. The EA file was then sorted 
by stratum and by group in ascending order, with the order of the EAs within each group being 
random. Each EA was considered successively and was suppressed for the three populations if: 


1) a positive difference remained non-negative after suppression of the EA; 
2) anegative difference was not further reduced. 


In this way, each positive difference was reduced to a number as close as possible to zero, 
considering the random order of the EAs. 
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APPENDIX 


Question 20 of Census Form 2B 


20. a) Are you limited in the kind or amount of activity that you can do because of a long- 


term physical condition, mental condition or health problem: (See Guide) 


At home? 
L} No, I am not limited 
LJ Yes, I am limited 


At school or at work? 
L] No, I am not limited 
L] Yes, I am limited 


|] Not applicable 


In other activities, e.g., transportation to or from work, leisure time activities? 
|] No, I am not limited 
[] Yes, I am limited 


b) Do you have any long-term disabilities or handicaps? 
L] No 
fal Yes 


Screening Questions for HALS (Questionnaire for Adults) 


iN 


Do you have any trouble hearing what is said in a normal conversation with one other 
person? 


. Do you have any trouble hearing what is said in a group conversation with at least three 


other people? 


- Do you have any trouble reading ordinary newsprint, with glasses if normally worn? 
. Do you have any trouble seeing clearly the face of someone from 12 feet/4 metres 


(example: across a room), with glasses if normally worn? 


. Do you have any trouble speaking and being understood? 
. Do you have any trouble walking 400 yards/400 metres without resting (about three 


city blocks)? 


. Do you have any trouble walking up and down a flight of stairs (about 12 steps)? 
. Do you have any trouble carrying an object of 10 pounds for 30 feet/5 kg for 10 metres 


(example: carrying a bag of groceries)? 


. Do you have any trouble moving from one room to another? 
. Do you have any trouble standing for long periods of time, that is, more than 20 minutes? 


Remember, I am asking about problems expected to last 6 months or more. 


. When standing do you have any trouble bending down and picking up an object from 


the floor (example: a shoe)? 


. Do you have any trouble dressing and undressing yourself? 
. Do you have any trouble getting in and out of bed? 
. Do you have any trouble cutting your own toenails? 
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17. Do you have any trouble using your fingers to grasp or handle? 

18. Do you have any trouble reaching in any direction (example: above your head)? 

19. Do you have any trouble cutting your own food? 

20. Because of a long-term physical condition or health problem, that is, one that is expected 
to last 6 months or more, are you limited in the kind or amount of activity you can do . 
(i) at home? (ii) at school or at work? (iii) in other activities such as travel, sports, or 
leisure? 

21. Has a school or health professional ever told you that you have a learning disability? 

22. From time to time, everyone has trouble remembering the name of a familiar person, 
or learning something new, or they experience moments of confusion. However, do you 
have any ongoing problems with your ability to remember or learn? 

23. Because of a long-term emotional, psychological, nervous, or mental health condition 
or problem, are you limited in the kind or amount of activity you can do? 
(i) at home? (ii) at school or at work? (iii) in other activities such as travel, sports, or 
leisure? 
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Comparison of Estimators of Population Total 
in Two-Stage Successive Sampling 
Using Auxiliary Information 


F.C. OKAFOR! 


ABSTRACT 


Singh and Srivastava (1973) proposed a linear unbiased estimator of the population mean when sampl- 
ing ON successive Occasions using several auxiliary variables whose known population means remain 
unchanged for all occasions. In this paper, three composite estimators T,, T, and 7;, each utilising 
an auxiliary variable whose known population mean changes from one occasion to the next, are presented 
for the estimation of the current population total. The proposed estimators are compared with the 
ordinary estimator, 7), and the usual successive sampling estimator, 7’, of the current population 
total without the use of auxiliary information. We find that using auxiliary information in conjunction 
with successive sampling does not always uniformly produce a gain in efficiency over Ty or T’. 
However, when applied to a survey of teak plantations to estimate the mean height of teak trees, i), 
T, and 73 proved more efficient than 7) and T’. 


KEY WORDS: Successive occasion; Partial matching; Auxiliary variate. 


1. INTRODUCTION 


The theory and practice of surveying the same population at different points in time - 
technically called repetitive sampling or sampling over successive occasions — have been given 
considerable attention by some survey statisticians. The main objective of sampling on suc- 
cessive occasions is to estimate some population parameters (total, mean, ratio, etc) for the 
most recent occasion as well as changes in these parameters from one occasion to the next. 

The theory of successive sampling was initiated by Jessen (1942). Many authors have since 
contributed, especially in the estimation of population means. Among them are Singh (1968), 
Abraham et al (1969), Kathuria and Singh (1971), and Kathuria (1975), to mention but a few. 

Singh (1968) was the first to extend the theory of unistage sampling to two-stage sampling 
on successive occasions. He considered the sampling scheme in which, on the second occa- 
sion, a fraction J of the first stage units (FSUs) selected on the previous occasion is retained, 
along with their selected second stage units (SSUs), and a fraction » (A + p = 1) selected 
afresh. He then obtained a minimum variance unbiased estimator of the population mean 
on the current occasion. 

Abraham et al (1969) considered the situation in which partial matching of units was car- 
ried out at both stages. Units were selected by simple random sampling without replacement 
(SRSWOR). Kathuria (1975) modified this by using probability proportional to size and with 
replacement (PPSWR) for selection of the FSUs, and proposed a linear composite estimator 
for the population mean on the current occasion. 


! F.C. Okafor, Department of Statistics, University of Ibadan, Ibadan, Nigeria. 
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When an auxiliary variable is highly correlated with the characteristic under study, the 
estimate of the population mean (total) of this characteristic can be improved using the aux- 
iliary variable. Singh and Srivastava (1973) used auxiliary information to improve on the 
estimator of Singh (1968). They obtained a linear unbiased estimator of the population mean 
on the most recent occasion using several auxiliary variables whose population means are 
known and are the same for all occasions. Kathuria (1978) developed this study further by 
assuming that the population mean of the auxiliary variate is not known. He used a double 
sampling technique to estimate first the population mean of the auxiliary variate and then 
the mean of the characteristic under study. 

In their contributions, Singh and Srivastava (1973) and Kathuria (1978) assumed that the 
necessary information on the auxiliary variables can be obtained from the respondents or 
reporting units (SSUs). This is not generally the case. It may happen that the information 
on the auxiliary variable is too distorted to be useful because of the sensitive nature of the 
question, or the respondents may refuse outright to supply any information. Alternatively, 
the information on the auxiliary variate may not be collected because the required question 
is not included in the questionnaire. 

Singh and Srivastava also assumed that the known population total of the auxiliary variable 
is the same for all occasions. This may not be true in practice. If the population total of 
the main characteristic changes from one occasion to the next, there is every likelihood that 
the population total of any other variable correlated with it will also vary. 

In this paper three composite estimators of the population total using auxiliary informa- 
tion and a two-stage successive sampling scheme are proposed. The performances of the three 
estimators are compared empirically and they are also applied to a survey of teak planta- 
tions to estimate the mean height of teak trees. 


2. SAMPLING FOR TWO OCCASIONS 


For all three proposed estimators, we assume that the population total of the auxiliary 
variable changes on the second occasion. 

The estimators of the population total (mean) based on the partial matching scheme are 
better than the ordinary estimators of the population total (mean) without partial matching. 
Therefore, it is expected that the proposed estimators 7,, 7, and 7; will perform better than 
the ordinary population total estimator, T,, and the estimator based on the partial matching 
scheme without the use of auxiliary information, 7’. 

In deriving these estimators, we assume that: 


(i) the sample size is constant on each occasion; 
(ii) the normed size measure P; for the i“ first stage unit (FSU) is fixed for each oc- 
casion; 
(iii) N and M,;, population sizes for the FSUs and the second stage units (SSUs) within 
the i” FSU respectively, are constant for the two occasions; 
(iv) the population total (mean) of the auxiliary variate is known. 


Assumptions (7) — (iii) apply to T’, 7,, T, and 73, (iv) applies to 7,, T, and 73, but 
notkto: Taeand, 77. 

On the first occasion, a sample S, of n FSUs is selected with probability proportional 
to size and with replacement (PPSWR) using P; as normed size measure for the i” 
(i = 1, 2, ..., N) unit. For the selection of SSUs, we adopt the method due to Cochran 
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(1977, p. 306), which stipulates that if the i” FSU in S, is drawn 0; times (i = 1, 2, ..., 7), 
we select 6; independent subsamples of size m; from the M; SSUs. 

On the second occasion, we select a sample of An (0 < A < 1) FSUs from S, by simple 
random sampling without replacement (SRSWOR). The SSUs selected on the first occasion 
are retained for each of these An matched FSUs. Then, a fresh sample of wn (u = 1 — A) 
FSUs is selected independently from the N FSUs by PPSWR, with P; as normed size measure 
for the i” FSU. In each of the xn FSUs, the SSUs are selected as on the first occasion. 


3. NOTATION 


We define y;;(x;;) as the value of the study variate for the /” SSU in the i” FSU on the 
current (previous) occasion. In addition, z,;; is defined as the value of the auxiliary variate 
for the j“ SSU in the i” FSU on the h” occasion (A = 1, 2). The sample means for SSUs 
in the i” FSU are 


mj 


1 ‘aia 2. tie 
7, Lu ae: L» yg and 2p; = 7 Lu Zhij - 
= a j= 


The population total for the i” FSU and the overall population total for the auxiliary variate 
are 


Mj N 
jars 3 Zpij and Z, = Dd Zni- 
j=l 


Po 


We define additional notation as follows: 


N 

2 (y) = AY P; a — Y)* is the between - FSU variance; 

i=l i 

S2, (y) > Mi ) S2; (v) is the variance among SSUs within the FSUs 

w = PSUNGuY aip wi Vv ; 

‘ ire on M; ‘ 
M; 

ae iy) = pee (yi — ¥;)° is the variance among the SSUs in the i” FSU; 


S* (y) = 85 (vy) + SiO); 

Cy (%¥) = pS, (xX) Sp (y) is the between-FSU covariance of x and y; 

Cy, (%Y) = pS, (x) S,(yv) is the covariance of x and y among SSUs within the FSUs; 
C(xy) = Cy(%y) + Cy (%Y). 


The between- and within-FSU correlation coefficients between x and y are respectively p, 
and p,,. 
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4. ESTIMATORS FOR THE POPULATION TOTAL AND 
THEIR OPTIMUM VARIANCES 


4.1 Case (i) 


The first estimator of the population total, Y, on the second occasion is used when infor- 
mation on the auxiliary variable is not available but the FSU population total of the aux- 
iliary variable is available for the selected FSUs. It is given as 


T, = 0(1) T, 0) + A —- @Q)) 7, ) (4.1) 


6 (1) is a constant chosen so that the variance of 7,, V(7,), attains a minimum; while 


Te a | au Lj 
Be ain aes Seis siecle 
ling va 
ayae ie i { we k(1) (F-2)} 
i=1 f f 


is the difference estimator of Y based on the matched sample; 


nu 
TC ND ee k(1) (2-2) 


np i=l Tt U 


is the estimator for Y based on the unmatched sample; and k(1) and b(1) are known con- 
stants. 

For this estimator, it is assumed that the population total of the auxiliary variate, Z;, is 
available for each selected FSU on each occasion. The overall population total, Z, is also 
available on each occasion. No additional information on the auxiliary variate is obtained 
from the respondents (SSUs). 

Now by minimizing V (7; ) with respect to (1) and solving, the optimum value of 0 (1) 
becomes 


09 (1) = XAQ(1)/A (1) 
where 
Ax (1) = S*(y) + k7 (1) S$ (22) — 2K(1) Cy (Z2,), 


A(1) = Ap (1) + pw? {b7(1) Ai (1) — 20(1) 8 (1) }. 


The optimum value of k(1) is obtained by minimizing V (7,,(1)) with respect to k(1). This 
gives ko (1) = Cy (Z2,¥) /S5 (Za). 
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It can be shown that the optimum V(7,) for a given \, following the method adopted 
by Jessen (1942), is 


1 
Kg lh a= r [Az (1) + wb? (1) A; (1) — 20(1) B(1)}] A. (1) /AC1) (4.2) 


where 


AM) ae eS- (x ie KR (1) S512) = 2k 1)..Cs (ex). 


B(1) C(x,y) + k?(1) Cy (21.22) — KC) (Cy (2) + Cy (zy) )}, 


A(i) 


Az(1) + w?{b?(1) A,(1) — 20(1) B(1)}. 


Minimizing the variance of T,,,(1), the optimum b(1) is 


Dil) = BG) 7A, (1). 


If bo (1) is substituted in (4.2), the optimum variance becomes 


(4.3) 


1 [ A,(1) A2(1) — 4871) 
Yi) =< es 
a | — »?62(1) | a 


By minimizing V,(7;) in (4.2) with respect to », the optimum matching fraction boils 
down to X\yg = | — po where 


Ho = Ap(1) [An (1) + (A3(1) + A2(1) (67(1) A, (1) — 20(1) B(1)) 3 471 4. (4.4) 


If A,(1) = Aj,(1), i.e. the population variability is the same on both occasions, the ex- 
pression in (4.3) yields 


LP Zea yricke (i 
Vo(T;) == ay ee | A(1) (4.5) 


n | A*(1) — »*87(1) 


while the optimum matching fraction, yo (given in (4.4)), with bo(1) substitued for b(1) 
becomes 


vo = A(1) [A(Q1) + (A201) — B7(1))} 477. (4.6) 


When so is substituted in (4.5) the variance works out as 


Vo(T;) = — [AQ) + (A?2(1) — 6701) } 71. (4.7) 


2n 
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4.2 Case (ii) 


The second estimator is the usual one in which information is obtained on both the main 
and auxiliary characteristic from the reporting units and the population total of the aux- 
iliary characteristic is known. 

It is written as 


dS UN al G8 2) ec ON 2) ) | EON (4.8) 
where 
oniptaike et 2) Ee M; 2; 
Ta) es Sele ALO (ay 
(2) Solan wilSs ) 
Ie (OG M2; 
TO lela GY | ee ay 
Oh eae @ (% )} 
! > a ae (Baa). 
Leet P; - 
and 


ra 1 Miy; M22; 
T..(2)-= 7 yy ae) ce -22)}. 


i=1 : ; 


Here the overall population total of the auxiliary variate is known on both occasions. 
In addition, information on the auxiliary variate, Zi; is obtained for every SSU in the sam- 
ple. This is the usual way of using the auxiliary information in a two- stage design described 
in the literature. It can be shown that the optimum variance of 7, is 


| 


Vo(T2) = — [An (2) + w(b?(2)Aj(2) — 2b(2)B (2) }]An(2)/A(2) (4.9) 


SiS 


and the optimum weight is 


99 (2) = NA? (2) /A(2) 


where 
Az (2) = S?(y) + k?(2) S*(z,) — 2k(2) C(z,y), 
Aj (2) = S*(x) +R? Q)ys8*G@,) 2k (2) Clzx), 
B(2) = C(x%y) + k7(2~C(zzz) (2) (C(Z}.y) + C(x, 25) 95 


A(2) = A(2)/+ jp (b7@) A; (2) ='20(2) 6 (2))}. 
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The optimum value of k(2) is kj (2) = C(22,¥) / Sy(2). 


By substituting the optimum regression coefficient by (2) = B(2)/A,(2), obtained by 
minimizing the variance of T7,,, (2), in (4.9) and assuming that A,(2) = A,(2) = A(2) we 


have 
1 [ A*(2) — 6?(2) 
Vo (1h) = — | ———_——_— DB : 
SCO pee | A(2) (4.10) 


If the optimum yp is substituted in (4.10), the variance becomes 


1 
Votives Aa [A (2)) 4a A4(2) > B72) JA4: (4.11) 


4.3 Case (iii) 


The third way of utilising available auxiliary information to improve the estimate of the 
current population total, Y, under the given sampling scheme is similar to the second. The 
only difference is that the population total of the auxiliary characteristic is not known; 
however, its FSU population mean is known for the selected FSUs. 

This is given as 


dpe—¥0(6)9T (3) + le 63) ) eG). (4.12) 
where 


M. | - a 
= (Vp a (BJ Zpp oe Zo). 


l An Nias % x 
—0(3) (e — {X; — k(3) (24 — Zx)} 


~ 


LR, pale i = 
— {x; — k(3) @u- 2011, 


and 


| Ae mel, Die 
T,, (3) = es 3 P, yy Kk (3) Maia, eZ oy) 3 


For this estimator, we suppose that the values of both the main variate and the auxiliary 
variate are obtained for every SSU in the sample on both occasions. We also assume that 
the population mean, vA of the auxiliary variate is known for the selected FSUs. 

The optimum variance of 7; for a given is given as 


1 
Vol s)>= : [Az (3) + p{b* (3) A; (3) — 20(3) B(3)}] Aa (3)/A (3) (4.13) 
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while the optimum weight is as usual obtained as 


9 (3) = AA? (3)/A(3), 


where 


A,(3) = SO) YG) eG) = 2503) Cy 2.7). 
Ay(3) = S* (eletsbo(3)eSioles ur 2603) Gs, Gate): 
8(3) = Cy) eG) Ghlaizoy soe PL Cr(Z37) Ga C, x) 


A(3) = .A,(3) + w? (b7(3) A, (3) — 2b(3) 8 (3) }. 


The optimum value of k(3) is Ko(3) = Cy (2,¥) /S%, (22). 


If the optimum regression coefficient is substituted in (4.13), and it is assumed that popula- 
tion variances are the same on both occasions, then (4.13) works out as 


bali Ae (ee (2) 
Vo(T3) = — | —————_ | A (3). 4.14 
yes) eel (3) (4.14) 
When the optimum yp is substituted in (4.14), the variance is 
1 
Vo(Ts) = = [A(3) + (A*(3) ~ 8°(3)) "1. (4.15) 


4.4 Efficiency of the Proposed Estimators 


The variances given in (4.7), (4.11) and (4.15) will be used to compare the efficiencies 
of 7;, J, and 7; with respect to 


Tp is the estimator for y when there is no partial matching of units and no auxiliary in- 
formation used. In addition, the efficiency of 7) compared to the usual partial matching 
estimator 7’, which uses no auxiliary information, will be presented to assist in understand- 
ing the performance of the proposed estimators. 

The usual partial matching estimator is defined as 


Wer rs Iara (4.16) 
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where 
An — An = n = 
ye MiVi of 1 Mix; _ 1 Mix: 
iin P; d\n P; n Py 
L=,] yi ea =| 
and 
Tee eet SS 26 Miy; 
oe eal 
np \ P, 


The optimum variance of T’, obtained using the optimum value of b’, 
bg = C(x,y)/S?(x), and assuming S?(y) = S?(x) is 


mt SO) Su) 
Vera = ferent rece S2(y) . (4.17) 
DAS AY) ——puACOsy) 
Substituting the optimum value of p» in (4.17), the variance of 7’ becomes 
1 
AOS 1S) SEO) G2 Gy) 1 he (4.18) 


To calculate the efficiencies, the following assumptions about the correlation coefficients 
and the constant k were made: 


Py (X22) = py (ZY) = pp (21.22) = bps 


Py enareny) = pwylZiey) a yp (1,22) = Pws 


K(1)i =k (2) = k(3) S41: 


The efficiencies have been presented for only the positive values of p, and p,,, and a set 
of values of 


5 = Si,(y)/S$(y), Rp = S3(z)/S%(y) and R, = S2,(z)/S2(y). 


Looking at Table 2, we observe that none of the strategies 7,, T, and 7; (sampling design 
and estimator) is uniformly more efficient than strategy Ty. The contrary is true of 7’, 
which is always more efficient than 7); at worst, its gain over 7p is small (see Table 1). 

The results in Tables 1 and 2 show 7, is to be preferred to T’ only when R, = 0.05; and 
when p, = 0.8 and R, = 0.5. 
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Table 1 
The Efficiency of 7’ with Respect to 7) 


Pp 6 Py = 0.2 py = 9.8 

O72 0.05 1.01 1.01 
0.5 1.01 1.04 
5.0 1.01 Ee 
0.05 Ze 125 

0.8 0.5 bel Pe25 
5.0 1.02 ies 


T> is better than 7’ when: 
(i) 0), = 0:2 eRe 05: 
(il)t 076 — ps) 0. 8a =, e005 0s 


Gii) 6 = 0.5, 5.0, Ry = R, = 0.05, 90.5, p, = 0.2.and py = 0.8. 


T; is generally more efficient than 7’ when: 


(i) 6 


eri y mil ate 


(ii) 56 = 0.5, p, = 0.8 and R, = 0.05, 0.5. 


The maximum gain in efficiency of T’ over To is 25% (see Table 1). In Table 2, the max- 
imum gain of 7; over To is 155%, which occurs when p, = py = 0.8,6 = 0.05, R, = 0.5. 
The maximum gain in efficiency of 7, over 7p is 172%; this happens when p, = py = 
0.8,6 = RK, = 0.05. We also observe that when p, = p,, = 0.8,6 = R, = 5.0, the max- 
imum gain of 7; over To is 104%. It is therefore evident that the use of an auxiliary variate 
has tremendously improved the efficiency of partial matching of units. 

If we now take the three strategies 7;, T, and 73, and compare them among themselves, 
we conclude that none of the strategies is uniformly better than the other, even though the 
maximum gain in efficiency of 75 over Jo is higher than that of 7,, which in turn is higher 
than the maximum gain of 73. In general 7, is superior to 7, when p,, = 0.2, while 7> is 
better than 7; when p, = 0.8. 7, is preferred to 7; when p, = 0.8, p, = 0.2 and 
R, = 0.05, 0.5, and also when p, = py = 0.8 and 6 = R, = 0.05. Finally 7; is better than 
T, when p, = 0.8, R, = 5.0, and when p, = p, = 0.2 with R, = 0.5, 5.0. 


5. APPLICATION 


The proposed estimators were applied to a survey of teak plantations. The aim was to 
estimate the average height of teak trees using the girth as the auxiliary information. 
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Table 2 
The Efficiency of 7,, T, and 7; with Respect to Ty 


oe 


R,, Ry Ry | Strate- 
0.05 0.5 5.0 0.05 0.5 5.0 0.05 0.5 


1.04 1.04 1.04 0.83 0.83 0.83 0.20 0.20 
1.01 0.73 0.18 0.81 0.62 O17 0.20 On 


0.98 0.71 0.18 0.98 0.71 0.18 0.98 0.71 ez 


S 

N 

Nn i=) 
Nn 

— ee 

Gcore 

wm RW 

ts) 

co CO 

KN 


y 
T 
T, 
T; 

£:03—|-—1-03—4 —-0:87—| — 0:87 —| 0:87 —-|--0.27 | 0.97 Ty 

0:26 | 0.88" | 0174 | 70.25 | 0,27 |) 0:25 T, 

0.268) 51.02) | 50:84-5) 90:26 101,02, | 10.84 T; 

T 
T, 
T; 
T 
T, 
T; 


Nn 
co) 


1902 ON), £0275, | 18 OD ;ce 155.0197 |15.0-97 cer bar 0s9F ila 060A) ly 0.60 
1.04 0.99 | 0.65 | 0.60 | 0.60 
1.03 ; Unt pot RC bel Wl UE sand oe a aU 
2553 Pel w2k5Sur| 404 Senias45 
1.23 | 0.20 | 0.45 | 0.38 
O17 018 "| otlo6ne| «0-77 
1.74 | 1.74 | 0.45 | 0.45 ce 
ON ALE POL OM TS iB 
O88, 1 096 1 211.1 One ihe 
T 


1513 1213 0.83 0.83 
Fa Es 0.72 0.84 0.83 T, 
1.03 0.67 1.05 E03 T; 


wm S S 
ca) Nn oS 
Na 
— — tt 
Siew = — WW 
non a_i 


20 Okafor: Estimators in Two-Stage Successive Sampling 


Table 3 


Estimated Efficiency of the Proposed Estimators with Respect 
to Ty) in the Estimation of the Average Height of Teak Trees 


Se reeernern Mean height Variance Estimated 
(m) (m*) % Efficiency 
Ty) (no matching) 20.04 6.3118 100 
T’ Partial matching 18.06 4.0680 [os 
T; 17.86 0.0718 8791 
T, 17.31 0.0651 9635 
T; 17.99 4.0183 157 


The teak trees used in this study were planted in 1965 with different spacings, producing 
plantations with the following number of trees per hectare: 2,000, 800, 400 and 250 trees. 
To measure the trees, an area of 40 metres by 40 metres was mapped out in each plantation 
after a sample of 8 plantations (FSUs) had been selected from 16 plantations, using the 
PPSWR scheme. The number of trees in each plantation was used as a measure of size. All 
the trees in the 40m by 40m area constituted the second stage units and the girth of each 
tree at breast height was measured. For the height measurements, a subsample of the trees 
was selected from the 40m by 40m area in each selected FSU. The first measurements were 
carried out in 1981 and the second in 1983. The sampling scheme used was the same as the 
one described in Section 2, with 50% matching of the FSUs. 

The estimated efficiencies are given in Table 3. The sample estimates of the variance and 
covariance terms were used to obtain the optimum variances of 7’, 7,, T, and 7; because 
the population values of these variances and covariances were not known. Therefore, the 
low values of the estimated optimum variances of 7, and 7) can be attributed partly to the 
nature of the sample data and partly to the nature of the estimators. | 

We observe that the estimator 7> is more efficient than either 7; or 73, while 7, is more 
efficient than 73 in the estimation of the average height of teak trees using the girth as the 
auxiliary information. 
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Corrigendum 123 


‘Some Optimality Results in the Presence of Nonresponse’ by V.P. Godambe and M.E. 
Thompson, Survey Methodology (1986), 12, 29-36. 


Formula (2.6), the definition of the optimum estimating function in A” (p, q), should be 


h’* = MD (; = 0x; )a;/ 7; qj. 


les’ 


erin =" i _ 
Nee —_ 
a 


— a 


GUIDELINES FOR MANUSCRIPTS 


Before having a manuscript typed for submission, please examine a recent issue (Vol. 10, 
No. 2 and onward) of Survey Methodology as a guide and note particularly the following 


5.2 


points: 
1. Layout 
1.1 Manuscripts should be typed on white bond paper of standard size (8% x 11 inch), 
one side only, entirely double spaced with margins of at least 1% inches on all sides. 
1.2 The manuscripts should be divided into numbered sections with suitable verbal titles. 
1.3. The name and address of each author should be given as a footnote on the first page 
of the manuscript. 
1.4 Acknowledgements should appear at the end of the text. 
1.5 Any appendix should be placed after the acknowledgements but before the list of 
references. 
2. Abstract 
The manuscript should begin with an abstract consisting of one paragraph followed 
by three to six key words. Avoid mathematical expressions in the abstract. 
3. Style 
3.1 Avoid footnotes, abbreviations, and acronyms. 
3.2 Mathematical symbols will be italicized ey specified otherwise except for functional 
symbols such as “exp(-)” and “log(-)’, e 
3.3. Short formulae should be left in the text a everything in the text should fit in single 
spacing. Long and important equations should be separated from the text and numbered 
consecutively with arabic numerals on the right if they are to be referred to later. 
3.4 Write fractions in the text using a solidus. 
3.5. Distinguish between ambiguous characters, (e.g., wW, w; 0, O, 0; 1, 1). 
3.6 Italics are used for emphasis. Indicate italics by underlining on the manuscript. 
4. Figures and Tables 
All figures and tables should be numbered consecutively with arabic numerals, with 
titles which are as nearly self explanatory as possible, at the bottom for figures and 
at the top for tables. 
4.2 They should be put on separate pages with an indication of their appropriate place- 
ment in the text. (Normally they should appear near where they are first referred to). 
S. References 
5.1 References in the text should be cited with authors’ names and the date of publication. 
If part of a reference is cited, indicate after the reference, e.g., Cochran (1977, p. 164). 
The list of references at the end of the manuscript should be arranged alphabetically 


and for the same author chronologically. Distinguish publications of the same author 
in the same year by attaching a, b, c to the year of publication. Journal titles should 
not be abbreviated. Follow the same format used in recent issues. 
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In This Issue 


Two new features appear for the first time in this issue of Survey Methodology. ‘‘In This 
Issue’’ summarizes papers appearing in the Journal and will appear regularly. The other new 
feature, a ‘‘Short Communications”’ section, will appear in the Journal from time to time. 


This issue contains nine papers, four dealing with estimation and weighting methods, 
including two on family estimation. Fritz Scheuren’s initiative and editorial assistance were 
instrumental in putting this special section together. 


The first three papers in the special section deal (at least in part) with least-squares methods 
for weighting survey data. There is a certain historical irony in this. In their 1940 paper, 
Deming and Stephan introduced iterative proportional fitting as a quick practical way for 
approximating the estimates obtained by minimizing a squared function of the cells of a contin- 
gency table, subject to restrictions on the margins. The use of this technique has become 
fairly generalized in weighting survey data, where it is known as “‘raking ratio estimation’’. 


In “‘An Alternative Method of Controlling Current Population Survey Estimates to Popu- 
lation Counts’’, Copeland, Peitzmeier and Hoy compare a raking ratio estimator to a gen- 
eralized least-squares estimator under the same marginal restrictions. The comparison is carried 
out for estimates of individual characteristics obtained from the Current Population Survey, 
a household survey conducted by the United States Bureau of the Census. They note that 
the estimates produced by the two methods are very similar. 


Most current methods of weighting data from household surveys produce weights that 
differ from person to person within the same household. A single weight per household, 
in addition to its conceptual appeal, would eliminate the recurrent and often awkward 
discrepancies between person-based and family-based estimates. Alexander, in ‘‘A Class 
of Methods for Using Person Controls in Household Weighting’’, considers a class of 
“constrained minimum distance”’ methods (including GLS) which actually yield a single weight 
per household yet respect person-level marginal totals. The properties of these methods in 
the presence of undercoverage are then studied through some simple coverage models. 


Lemaitre and Dufour, in ‘‘An Integrated Method for Weighting Persons and Families’’, 
propose a regression estimator that also yields a single weight per household and is equivalent 
to the GLS estimator under certain general conditions. Using Canadian Labour Force Survey 
data, they obtain large efficiency gains for estimates of families, and marginal gains for 
estimates of persons, relative to current methods. 


In the last paper in this section, ‘‘Modified Raking Ratio Estimation’’, Oh and Scheuren 
describe an estimation procedure similar to the usual raking ratio. Their method can be used 
when population totals are available not only for the margins, but also for interior cells in 
a multi-way table. It combines conventional ratio estimation for cells with large sample sizes 
and raking ratio estimation for cells with sample sizes that are small (or zero). In an appli- 
cation involving sampling of corporate income tax returns, the Oh-Scheuren approach 
produced more efficient estimates relative to conventional ratio estimation. The authors stress 
that, before their method is offered for wide use, further work is needed including, among 
other things, comparison with conventional collapsing schemes. 


The other four papers in this issue consider the development and application of methods 
and procedures with regard to probabilities of response in a survey context, rounding criteria 
for protection of confidentiality, data collection and analysis for retrospective type surveys, 
and variance estimation for the Canadian Labour Force Survey. 


126 In This Issue 


Every survey has some nonresponse problems. These are usually handled by imputatior 
or adjustment procedures based on the assumption that nonresponse occurs at random withir 
imputation or adjustment classes. The resulting estimates are generally biased whenever thi: 
assumption is not satisfied. Various methods of estimating response probabilities involvin; 
models have been proposed, notably by Cassel, Sarndal and Wretman (CSW), but thes 
methods are not effective when the assumed model is inadequate. In ‘‘Nonparametric Method, 
for Estimating Individual Response Probabilities’, Giommi describes nonparametric proce 
dures for estimating response probabilities using auxiliary information, providing an alter 
native to the CSW estimator that is robust against both population and response mode 
breakdown. The resulting estimators perform well in Monte Carlo simulation studies. 


Random rounding is used to ensure the confidentiality of information about individual 
in statistical aggregates. In the context of the 1971 Canadian Census, Nargundkar an 
Saveland developed a rounding process that is unbiased in the sense that the expected valu 
of the rounded data is the same as that of the unrounded data. Fellegi (SMJ, 1975) intro 
duced controlled random rounding, a procedure that, in addition to being unbiased, als 
preserves additivity. Several other papers have since appeared, including the very recen 
work of Cox (JASA, 1987), generalizing and extending the applications to other fields. I 
‘“Rstimates Based on Randomly Rounded Data’’, Withers develops an expression for th 
variance of unbiased estimates of cell probabilities and presents a comparison of efficiencie 
involving the rounding processes used in Australia, the United Kingdom, New Zealand an 
Canada. He also extends his results to any smooth function of the cell probabilities fo 
applications in different areas of statistics. 


In ‘‘Variance Estimation for the Canadian Labour Force Survey’’, Choudhry and Le 
describe studies conducted to select a variance estimator for raking ratio estimates from th 
Canadian Labour Force Survey. Their paper reports on a comparison of three varianc¢ 
estimators for the random group sampling design: Keyfitz, Rao-Hartley-Cochran and Rac 
In spite of its slight inferiority to the other two methods in terms of bias and stability, th 
Keyfitz method is suggested for actual use because of its operational simplicity. 


In ‘‘The ‘““AGEVEN”’ Record: A Tool for the Collection of Retrospective Data’’, Antoin 
Bry and Diouf describe techniques used to collect data on natality and mortality of wome 
in Pikine, a suburb of Dakar, Senegal. The retrospective procedure employed involved placir 
observed events (mainly births and deaths) in their socio-economic context and, accordir 
to the authors, made it possible to ‘‘better assess the relationship between urban insertic 
and changes in demographic behaviour’’. Analysis of data from the survey clearly indicat: 
that child mortality rates are higher for children born in rural villages than for those bo! 
in Pikine. 

It is well known that the Hansen-Hurwitz strategy is inferior to the Horvitz-Thomps¢ 
strategy associated with a number of IPPS (inclusion probability proportional to size) samplil 
procedures. In the final piece in this issue, in the ‘Short Communications’”’ section, Prabh 
Ajgaonkar presents proofs of these results that are much simpler than those already availat 
in the literature. | 


The Editor 
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Nonparametric Methods for Estimating Individual 
Response Probabilities 


ANDREA GIOMMI! 


ABSTRACT 


This paper deals with the nonresponse problem in the estimation of the mean of a finite population, 
following an approach closely related to that of Cassel, Sarndal and Wretman (1983). Two very simple 
methods are proposed for estimating the individual response probabilities; these are then used, in con- 
nection with a superpopulation model, to construct estimators for the population mean. A first evaluation 
of the properties of the proposed methods is given by a Monte Carlo experiment. The results shed 
some light on their effectiveness. 


KEY WORDS: Nonresponse; Individual response probability; Nonparametric methods. 


1. INTRODUCTION 


Dealing with the estimation of finite population mean (or total, etc.) in the presence of 
nonresponse, Cassel, Sarndal and Wretman (1983) introduced a very general estimation 
method based on the fundamental concept of individual response probability (IRP). The 
authors proposed estimators which are in part determined by a superpopulation model and 
in part by a response model, i.e., a model formalizing the response mechanism and by which 
IRP can be estimated from sample data. The estimation of IRP is the crucial point of their 
theory. In fact, if the superpopulation model is not correctly chosen, as is often the case, 
only a correct choice of the response model may guard the estimators from design bias. By 
a Monte Carlo experiment, Giommi (1985a) showed that a response model supplying a ‘‘good 
approximation’’ of the ‘‘true’’ response model can restore virtual unbiasedness; but little 
is known about the extent of a good approximation and in any case the choice of a response 
model may prove cumbersome besides being arbitrary. A natural way of avoiding these dif- 
ficulties is to estimate the IRP by nonparametric procedures. In the present paper we pro- 
pose two very simple methods to estimate IRP when available auxiliary information (which 
is assumed to be related to the response behaviour) is represented by a single continuous 
variable. The methods which make use of some tools of the kernel estimation theory may 
be viewed as an extension of the popular correction technique for nonresponse consisting 
in reweighting units by adjustment cells. 

In this paper some empirical evaluations of these methods are described and the results 
regarding the bias and efficiency of the related estimators are presented. 


2. ESTIMATION OF THE INDIVIDUAL RESPONSE PROBABILITIES 
Let us consider a population of N units labelled kK (k=1, 2,..., N), and let Y be a variable 
under study, of which we want to estimate the mean Y=L, y,/N froma sample s of n units, 
the selection being based on a given design p(s). For the estimation, auxiliary informa- 


tion is available, represented by known values XpecK = 1h OVEN) SoMa Scalar continuous 


' Andrea Giommi, Department of Statistics, University of Florence, Via Curtatone, 1, 50123 Florence, Italy. 
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variable X (the extension of the procedures proposed for the multidimensional case is, in 
principle, straightforward). 

In the sample, Y is observable only in a subset r of n, respondents and not on the n — n, 
nonrespondents. After the selection of the sample, the available information can be represented 
as follows: 


Cea ele Vee) keeoscNpens 


where J; is an indicator random variable such that E(/,) =q, and q, is the IRP. 
To estimate g,, a parametric model is generally assumed (Cassel et a/. 1983) such that: 


Gk =Q(9, Xx), 


where 9 is an unknown parameter (or vector of parameters) and q(-,-) is a functional form 
to be specified. Estimated g, are then obtained replacing in the above parametric model 
estimated values 0 of 0. 

In this paper the estimates of g, (kK € r) are obtained by avoiding any parametric 
specification of the function q(-,-); nevertheless, maintaining the hypothesis that the IRPs 
depend on the values x,. Two procedures (methods (1) and (2)) are proposed. 

In the first, g, (kK €1r) is estimated as the response rate (i.e. the proportion of | 
respondents) in a group of units centered on the unit k, corresponding to an appropriate 
interval of x-values centered at x,. Assuming that 2h, is the length of such an interval, q, 
is estimated by the following ratio: 


Gx = Y Die = %)/ Y) Dlx - %): (1) 


Jer JES 
where 


1 if Xx = x;| = hy 
De ee) 
0 otherwise. 


It is evident that the estimate G, depends on h, or h if we adopt - as in this paper - a cons- 
tant interval; the numerical specification of A is a main problem in applications. 

In the second procedure, all the sample units, rather than a group, contribute to the estima- 
tion of g,. By this method the possible limitation due to the classification of responding units" 
in groups is removed. In other words, one might consider overly restrictive the fact that in 
the estimation of g, some units contribute with weight 1 and some others with weight 0. 
With method (2), the estimate is given by: 


a = > DPX =X) > DE Xe) (2) 

Jer JES | 
where D* has to be specified. In this case, each value x; contributes towards the estimate 
G, through D*, an amount inversely related to the difference |x, — xj]. | 


In (2), the problem is twofold: i) to specify the functional form D* and ii) to define the | 
values of its parameters. In this paper we adopt a function D* of the normal type: | 


| 
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D*(z) = (h?2nr) 7” exp (—z7/2h7); ZESIKRE Gy; (3) 


in which the standard deviation, indicated by h, plays a role analogous to that of the parameter 
h in the expression (1). In both (1) and (2), when / increases, G, approaches to the constant 
value n,/n. In (1), it reaches n,/n when h covers the whole range of the x-values. 

An empirical study was designed to evaluate the properties of the proposed procedures, 
using a very wide range of / values. In the present paper we have limited ourselves to repor- 
ting results for only three (constant) values of A, equal to 1/10, 3/10 and 5/10 of the range 
of the x-sample values. Finally, we must observe that both expressions (1) and (2), apart 
from a normalizing factor, show themselves as the ratio of two probability density kernel 
estimators (in the approach of Rosenblatt (1956)) over different sets of x-values. Therefore, 
as suggested by Giommi (1985b), the value of h may be selected considering proposals put 
forward in that theory. 


3. SUPERPOPULATION MODEL AND ESTIMATORS 


For the choice of the estimator of Y, we assume a superpopulation model ® in which the 
population values y,, k=1, 2,..., N, are considered to be a random sample such that: 


Eg( Yn) = ux = Bxx, 
(4) 


Vareg(Y;) = o; = eee 


where 6 and ® unknown and x, is the known value of the auxiliary variable X. It is apparent 
that the superpopulation model employed here is mainly applicable to quantitative rather 
than qualitative variables; other models should be employed in such cases. We further limit 
ourselves to the consideration of simple random samples. Providing the variance of Y may 
be specified as in (4), Cassel et al. (1983) have shown that the following estimator: 


it ®(Du/a) / (E,%/ar), 


where ©, indicates the sum over the set r and ¥=L}'x,/N, is approximately unbiased, 
thanks to the q, correction, even if the first equation in (4) fails to specify the true relation- 
ship between X and Y. This may happen, for example, when the ‘‘true’’? model has an in- 
tercept or has two regression coefficients (see (5) below), etc. 

Unfortunately, in practice the estimator T cannot be used since gq, is unknown. The pro- 
blem is, therefore, to evaluate its properties when q; is replaced by its estimate derived either 
from method (1) or (2). 

We shall examine such estimators, for the three chosen values of h. We denote the 
estimators by 7D; and TD* where i=1, 3, 5 as in Table 1. 


Table 1 


Definition of Estimators 


Estimators 
h Method (1) Method (2) 
0.1 TD, (dar 
0.3 TD, DY: 
0.5 FD: TDs 
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In addition, also the following estimators are considered in the Monte Carlo study: 


Tore a(D/ D>) and Tie (DWE) 


TC is the full sample estimator, that is, the ratio estimator under the hypothesis of complet 
response and 7/ is the same estimator based on the set of respondents, on which no q 
correction is made for nonresponse. Note that 77 is also an estimator derived from a we! 
known procedure of imputation (by regression) of missing values (Cassel ef a/. 1983) an 
equals TD when h covers the whole range of the x-values. TI is approximately unbiased onl 
if (4) is true. The bias, as we shall see, depends on the divergence between the condition 
in (4) and those of the population under study. As in the experiment of the next section mod 
(4) will be a ‘‘false’’ model (that is, the study populations are specified by models differen 
from (4)), the simulation also contributes to the knowledge of this very simple and widel 
used imputation method. 


4. THE MONTE CARLO EXPERIMENT 


In the Monte Carlo experiment two populations, POP1 and POP2, were generated follov 
ing the same procedure as that of Sarndal and Hui (1981). POP1 and POP2 are both con 
posed of two strata, say S1 and S2, 500 units each and satisfy the following equations: 


Ea( Vx) = BiXm + Borp- 
Vars(Y;) = Ca + O5Xp, 


where X;; = X,0, and x4. = x,(1 — 9x), with 0, = lif k € Sl and d% = Oif k € S2. TI 
difference between (4) and (5) simulates one of the many errors which one can incur in spec 
fying the superpopulation model. The numerical characteristics of POP1 and POP2 are show 
in Table 2. 

The simulation procedure can briefly be described in the following steps: 


1) A simple random sample s of n (n=S50, 100) units is selected from each populatio} 


Table 2 


Characteristics of Simulated Populations 
FOE WO LL RRT UL BNS RETRE RSE SOUR Woe nies ee rr 
Population POP1 POP2 | 
and strata Mean SD CV SK Mean SD CV f 


Oh ee Os eS ae eS ee See eee 
| 


Stratum 1 i 19.305 12.71 .66 1-30 20.037 14.50 eta pe: 
y Tole Bais eal 1.62 1.961 PRP 1.13 3: 
Stratum 2 x 50.325 A 42 | 49.775 23.28 47 1. 
y 30.325 13.38 44 sal 2s 44.862 21.31 47 I, 
Total 24 34.815 23.42 .67 .90 34.906 24.44 .70 | 
y 18.969 15.26 .80 1.06 23.411 26.25 bez hil 


es 6 ee Ve ee eee eee ee eee eee 
SD = population standard deviation; SK = skewness (3rd moment / (2nd moment)?/*); CV = coefficient | 
variation. 
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2) The full sample values are recorded and nonresponse is then generated by each of the 
two following parametric models: 


Model A: qx, = exp(—Ox;,), 
Model B: g, = 0? @)~%:; Oe S10) afikiecS 1s (82), 


where the parameters ©, ©), ©, are chosen in such a way that the average response rate g 
over the whole population is alternatively 0.6 and 0.7. In practice, sets of respondents are 
obtained by performing a Bernoulli trial for each unit k € s, with probability q, for ‘‘suc- 
cess’’ (response) and 1 —q, for ‘‘failure’’ (nonresponse). 

3) The IRP is estimated by method (1) and (2) and, for each sample, the values of TC, 
a7, 1D, TD* are calculated. 

4) Steps 1 to 3 are repeated 1000 times and at the end we calculate: bias, variance (VAR) 
and mean squared error (MSE) of the estimators for each sample size (50, 100), response 
model (A, B), average response rate (0.6, 0.7) and population (POPT7 POR?) 


The experimental results are reported in Tables 3 and 4. 


5. RESULTS OF THE MONTE CARLO EXPERIMENT 


Some interesting elements emerge from the examination of Tables 3 and 4. 


1. As expected, TC is approximately unbiased in all of the experimental trials. 

2. In this experiment the bias of 7/ is always larger than that of TD and TD*. Therefore: 
at least in the situations of the experiment, the adjusted estimator is to be preferred over 
the non-adjusted one, which corresponds to a procedure of imputation by regression. 

3. For the same / value, the bias of TD is always smaller than that of TD*. The dif- 
ferences are negligible for h=.1. As h increases, TD* tends toward TY faster than ED: for 
h=.5 the differences between TD* and 77 are irrelevant for practical purposes. 

4. The reduction of the bias we are able to obtain using TD instead of 71 is always signifi- 
cant, varying from 55% to 82% for model A, from 67% to 92% for model B. TD* also 
experiences a notable reduction of the bias: from 51% to 68% for model A, from 61% to 
84% for model B. 

5. TD and TD* are equivalent in terms of MSE for h=.1, even though TD¥ is slightly 
more stable (i.e. has a lower variance). For h=.3 and h=.5, the lesser stability of TD in 
comparison with 7D* is generally compensated by the smaller bias, more than enough to 
make 7D preferable to TD* in terms of MSE. 

6. The estimators adjusted by the estimated IRP are not very stable but, in terms of MSE, 
must be preferred to 7Y. 

7. As expected, the bias is directly related to the increase of the nonresponse rate and 
to the divergence between the true superpopulation model and the one assumed (i.e. the false 
model on which the estimators are based). No relevant differences are revealed due to the 
response models considered in this paper (see Giommi (1984) for the effect of alternative 
models). 

8. The increase of the sample size seems to reduce the bias slightly for all the estimators 
considered. TD, and TD; are exceptions: in this case, the reduction of the bias cannot be 
attributed to experimental fluctuations but to the actual improvement of the estimate "Uk 
when 7 increases. 
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In the end, we may conclude that, in situations similar to the ones considered in this paper, 
the two methods suggested can be used, with a certain preference for method (1) given its 
simpler application. The problem of determination of the best value for / (or hy, in the 
general case) remains to be examined. We found that, within certain limits, small values for 
h reduce the bias but also reduce the stability of the adjusted estimator. We have found that, 
for our experimental examination, the optimum value of / is in the neighbourhood of 0.1. 
Results obtained from the same experiment but not reported in this paper indicate that a 
further reduction of A tends to increase the bias. This is to be expected since making / get 
closer to 0 results in a collection of estimates G, (K=1,...,7), equal to 1 and 0 respective- 
ly for the respondents and nonrespondents. 
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Table 3 


Performance of Different Estimators under Response Model A 


Estimators 18S Tl ED; ED; TDs ED; iD TDs@ 


ea a es SS EE—— Se 


Average response rate ¢=.60 


POPI1 
n>>0 BIAS .015 861 .349 .420 .669 .380 .620 .765 
VAR .405 .973 LS 1.036 1.007 1.041 995 989 
MSE .405 1.714 1.237 Ly h4s5 1.185 1.379 1.574 
00 BIAS .007 .805 .164 38) .610 Deg 544 .686 
VAR .186 .416 .443 .429 .412 415 .404 .402i 
MSE .186 1.064 .470 533 .784 .467 .700 .873 

RORZ 
=a BIAS .090 33125) 1.433 1.682 2.544 1.544 2.378 2.887 
VAR 3.952 8.744 9.821 9.823 9.743 9.390 9.233 9.118 
MSE 3.960 18.510 Ll S74! IOGo 16.215 lee a4 14.888 17.453 
T= 100 BIAS .056 2.959 .749 1.387 DESBill 1.004 2.104 2.566 
VAR ag) 4.144 AL S15) 52122 4.819 4.238 4.632 4.518 
MSE Le Wl3 12.900 5.076 7.046 10.281 5.246 9.059 11.102 

Average response rate g=.70 

POP1 
0) BIAS ONS 581 .226 WATE .418 .249 ANS 436 
VAR .405 .765 .794 .750 .738 .754 SP? iS | 
MSE .405 1.103 .845 823 913 .816 .924 94 
n= 100 BIAS .007 pil .099 205 .396 .143 BW 4 
VAR .186 328 1323 .307 324 238 1327 .33¢ 
MSE .186 .610 5383 .349 .484 38 .454 54a 
POP2 | 
n=50 BIAS .090 2.130 .813 .939 1.542 .887 1.453 1.823) 
VAR 3.952 6.996 MNP 6.827 6.991 6.708 6.753 6.87 | 
MSE 3.960 Wesss 7.783 7.709 9.396 7.495 8.864 10.194 
T= 00 BIAS .056 1.966 .473 953 1.541 .658 1.406 1.73, 
VAR 1.710 3.071 3.005 3.062 3.027 2.926 3.008 3.04) 


MSE ii gl we 6.937 3.229 3,970 5.402 S559 4.985 6.04: 
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Table 4 


Performance of Different Estimators under Response Model B 


—_——----OO 


Estimators ee TI TDs TD; TD; oo TD, TDs 


Average response rate 7 =.60 


POPI1 
n=50 BIAS O15 1.086 .290 383 .716 (323 .688 92 
VAR .405 .966 1.208 ROW 937 1.050 .907 .928 
MSE .405 2.145 1.29 1.158 1.450 1.154 1.380 1.912 
n=100 BIAS .007 1.079 Bi) .349 532 .196 .668 .902 
VAR .186 .422 513 .429 .420 -447 .401 .403 
MSE .186 1.586 se Sau! .956 .485 .847 P27 
FOFP2 
nm — 50 BIAS .090 4.046 1.362 WU S7 2.826 12562 2.749 32202 
VAR Bolte 10,285 emt 2549 BI122089 12.010 11.605 11.046 10,994 
MSE 3.960 26.655 14.374 15.176 19.996 14.045 18.603 23.682 
n=100 BIAS .056 3.897 .454 |e | 2.707 .853 oy)! 3.284 
VAR 1.710 4.151 5.432 By ell 103 4.798 4.541 4.381 
MSE i713, 19338 5.638 7.465 12.431 5:52) 103896" 15.166 
Average response rate G=.70 
POP1 
n=S5S0 BIAS AO) .584 .179 Al .409 .196 .376 .499 
VAR .405 Pon .826 425 .716 .769 23 .743 
MSE .405 1.092 858 .474 .883 .807 .864 992 
n= 100 BIAS .007 .536 .046 nS .365 .087 2517 .436 
VAR .186 SOT 318 295 295 .299 295 .302 
MSE .186 .594 .320 LS .428 .307 395 .492 
POP2 
m— 50 BIAS .090 2.057 .682 891 1.477 .804 1.392 1.822 
VAR Sey? 6.199 6.788 6.165 O.232 6.340 6.093 6.270 
MSE 3.960 10.430 E253 6.959 8.414 6.986 8.031 9.590 
n= 100 BIAS .056 1.918 os EW 51D3) WoSiil .374 LIS 1.562 
VAR 1.710 2.826 2.897 2.884 2.867 2.796 2.836 2,923 
MSE els 6.506 2.922 3.454 4.586 2.936 4.217 5.363 
ne ee ee OIA Es Be WR 
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Estimates Based on Randomly 
Rounded Data 


C.S. WITHERS! 


ABSTRACT 


lethods are given to estimate functions of the cell probabilities associated with a table of multinomial 
ata that has been randomly rounded to multiples of a given number, say /. We show that: (i) random 
yunding causes only second order effects on bias and variance; (ii) the loss of efficiency in using the 
atural estimates of cell probability is negligible provided that the cell entry is large compared with 
= — 1) / (6R) where R is the number of cells in the table; and (iii) estimates of apparently exponen- 
ally small bias are available for moments of these natural estimates and for polynomials in the cell 
‘obabilities. 


EY WORDS: Random rounding; Bias reduction; Efficiency. 


i. INTRODUCTION AND SUMMARY 


This paper gives methods of estimating a function of the cell probabilities associated with 
table of multinomial data that has been randomly rounded. Random rounding is a widely 
sed method for preserving confidentiality in situations where an entry of 1 ina table might 
entify an individual and so break a confidentiality requirement. Instead of tabling the value 
‘a table entry, say N, one rounds N to the nearest multiple of a given number / above 
with probability (w.p.) w or below N w.p. 1 — a, where a is chosen so that the rounded 
ilue M satisfies 


E(M|N) =N. 


hat is, if for some integer j, j] < N < (j + 1)/, then 


~ (G+ 1)lw.p. a 


mre a = r/landr= WN — jl. 

The rounding base / used by the Department of Statistics in New Zealand is / = 3, while 
atistics Canada reportedly uses / = 5. See Penny and Ryan (1986). 

Random rounding should not be confused with grouping or non-random rounding of sam- 
e values to the nearest integral multiple of / (associated with Sheppard’s corrections for 
oments). Nor should it be confused with intentional contamination, another method of 
serving confidentiality where one simply adds to N an independent random variable with 
ean 0. (The main disadvantage of intentional contamination is the possibility of a negative 
ll entry). For some references on these methods see Gastwirth et al. (1978) and Kendall 
id Stuart (1977). Some references on random rounding for multivariate data and grouped 
ita are also given in Gastwirth et al. (1978). 


2.8. Withers, Applied Mathematics Division, Department of Scientific and Industrial Research, Box 1335, Wellington, 
New Zealand. 
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In this paper we confine our attention to problems of estimating a function of the 
cell probabilities associated with a table of R values that have been randomly rounded. 
For convenience we label these cell probabilities as pj, -.-, DR rather than 
(plese J}, as is more usual for an J x J table. 

Thus, W =*2*p; and LX N; is the sum of the entries in the table. Let {M,} be the 
rounded values of {N,}. Given n, we assume {Nj} has the multinomial distribution with 
parameters n and {p;}. This is true with pj = mj;/2; m; if, unconditionally, {N;} are indepen- 
dent Poisson variables with means {mj}. 

Two unbiased estimates of p; are 


Pi = N,/n and p, = M,/n. (1. 


The first is not a true estimate since N; is not made available. The second is the natural 
estimate. (We assume 77 is reported. If it is not, there is negligible difference in replacing n by 
yk M,.) However, other unbiased estimates exist, namely the ‘‘complementary estimate”’ 


Bb, = — ) M;/n, (1.3) 
j#l 
and hence 
D\(\) = (1 — A)p, + AV, for any given X. (1.4) 


This raises the issue of what is the best \ to use, and what loss of efficiency there is in stick- 
ing to the natural estimate — that is, using \ = 0. An answer requires the variances of these 
estimators. These are given by 


Theorem 1.1. 


var(p,) = (p1 — pe) a7! + ((P = 1)/6 + An(ai)jn~* = Vn(P1), (1.5 


where hei 
AG oR S Ss ii 1) (PUNgsimodi=an)—nt a (1.6 
i=0 
Also, 
var(B1) = (pi — pin! + (R-1I)(P-1)/6 + VY Ae), 
ye) 
and 
var(p,(A)) = (py — pin! + fa(a)(P = 1)/6 + Valp)jn-?, (1.8 
where | 
os =o Glenn) 2 EC Ria ayN?F (1.6 
and | 
Vn(P) = (1 — d)7 A,(p1) + A? Lig An (Di). (1.16 


Proofs of the theorems in this paper are given in Section 2. 
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In Appendix A we give evidence that for 0 < p, < 1, P(N; mod / = i) — 1/7! — 0 
sxponentially fast as m — o, so that A,(p,) — 0 exponentially fast as n — o, and hence 
V,(p) also, provided p; # 0 for all i. 

Since a(\) is minimised by \g = R™' and a(Ag) = 1 — R7! so, asymptotically, is 
yar(p,(A)). Hence the loss of efficiency in using the natural estimate f, rather than the 
asymptotically optimal unbiased estimate p,(Ap) when R is large, is 


{var (p,) — var(p1(Ag))}/var(pi(Arg)) = (2 — 1)/{6Rn(p,; — pt)} (1.11) 


which is negligible provided M,(1 — M,/n) ~ n(p,; — pi) is large compared with 
(I? — 1)/{6R}. 

Generally M,(1 — M,/n) can be approximated by M,. This then gives a convenient rule 
9f thumb as to when the natural estimates are efficient. (If one or more {p,} are zero, since 
3; = Oimplies N; = M; = 0, &;,; must be interpreted as excluding cells for which p; = 0, 
and R as the number of cells in the table for which p; # 0.) 

Using (1.5) we can now make a brief comparison with the method of contamination. The 
Australian and U.K. statistics departments reportedly round by adding to each cell entry 
i w.p. 1/4, 0 w.p. 1/2 and —1 w.p. 1/4, so that 


Var(p))* ="(pp "payne E1p2an=: 


[he factor 1/2 improves on 4/3 for the New Zealand system (/ = 3) and 4 for the Cana- 
lian system (/ = 5). The cost is less protection (a maximum change of 1 as opposed to 2 
‘or the New Zealand system and 4 for the Canadian system), and a possibly negative cell 
try if the procedure is applied to cells with zero entries. 

Theorem 1.1 shows that random rounding has only a second order effect on the efficien- 
~y of estimating p, — the variance is only increased by a term of magnitude n 7. The next 
esult shows that this very important result is also true for estimating any smooth function 
Rese cele ne=oRe— 1, “p= (pp; .. sisep)), INCH! (Nj, ND NM = (Mj, = 24.M,)3 
»* = N/n and p = M/n. Thus we have cov(p*) = V/n where V = diag(p — pp’). 
Suppose now we wish to estimate f(p), a function with continuous second derivatives. 

That is, f(p) = 0f(p)/dp is a continuous r x 1 function and S(p) = 07 f(p) /dpdp’ 
$ a continuous r xX r function. 


Theorem 1.2. As n — o both E(f(p*) ) and E(/(p)) equal 
f(p) + B(p)n~' + O(n~’) where B(p) = trace (f(p)V/2). (ist2) 
Also both var(f(p*)) and var (f(p)) equal 
v(p)n~' + O(n~’) where v(p) = f(p) VA(p). (1.13) 


This theorem shows that 
a) random-rounding increases the variance of the natural estimate for f(p) by only 
O(n~’); and 
b) random-rounding likewise has only a second order effect on the bias of the natural estimate 
for f(p). 
According to (1.12), the natural estimate of f(p), f(p), has bias of magnitude n~!. We 


qow show how to reduce this to n~°. 
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Corollary 1.1..If for some function f,(p), E(/,(p*)) = f(p) + O(n~*) then 
E(f,(p)) = f(p) + O(n’). 


Two such choices for f,,(p) are the ‘‘delta-estimate’’ for which 


(Dep) a f yy filp) Pp; — p'sw)n / (2n), (1.14) 


i=1 


where f;;(p) = 0°f(p) /dp?, and the ‘‘jack-knife estimate’’ for which 
f,(p) = nf(p) — (n —- LF (1.13) 


where 
f = Ui, pf (Ui (mp — e)/(m — 1)]) + (1 — Li pp)f([mp/(n — 1))), 


e; = the i-th unit vector in R’, 


0, em 0 
and [<]| oR — Ris detineds by. [x] — 54-0 Osa ae 
Ne Xj ie! | 


These estimates were derived in Withers (1987a and 1987b). In particular, if f(p) is only 
a function of p,, say f(p) = g(p), then f,(p) = (pi) — &(M1) (1 — pt) /(2n) and 
f= pig Lines 1) 1 = 1) ee iaps (Gr 91), ak ore exam pleme 
f(p) = p% then the delta-estimate uses f,(p) = p+ {1 — (1 — p,)/n}. 

We now illustrate that if f(p) is a polynomial we can in fact find an estimate of f(p) 
based on the natural estimate with bias apparently exponentially small. We do this for the 


Caseu/(D) ea. 


Theorem 1.3. dy =. {pi — n—'p, — n7*( — 1)/6}(1 — 2—')-! estimates Ay = Di’ 
with bias A,,(p,)(n* — n) 7. 


Similarly if f,(p) is a moment of p then we can also find an estimate of f,,(p) with bias 
apparently exponentially small. We illustrate this for the case f,(p) = var(p;). 


Theorem 1.4. ., = n~'(p; — \;) — n~2( — 1)/6 estimates \,, = var(p,) with 
bias — A,(p,)(n? — n)7. 


These results may be generalised to higher order polynomials and moments using the ex: 
pression for moments and cumulants of p given in Appendix B. We now show that for the 
special case of f(p) collinear, an unbiased estimate exists. 

Theorem 1.5. Set f;(p) = I/_,p; where 1 < J < R and 

ape ne SDS an yn) eer ee in eee 


Then 
E(fi(p)) = Ef (p*)) = Sr(P) ane (1.17 


Hence an unbiased estimate of f(p) is /7(p) /@,,;. 
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Corollary 1.2. cov(/,,A.) = —p,p2/n. Its unbiased estimate is —f,p)/(n — 1). More 
generally for 1 < J < R, E(I/-, (fj — p;)) = c,Al/_,p; with unbiased estimate (Il!_ A) 
= 7/ Cn; where c,; = F7=0( —1) a) any: (The same result holds with p replaced by p*.) 


From (1.16) one may derive unbiased estimates for other special polynominals in p such as 
Dy, PiP2(P; + pr) and Uf, p} - but not for pip, or p}. 


Corollary 1.3. For 1 < J < R an unbiased estimate of 


I R 
fi(p)) DAS sco) — In~' — ys 2, 1 Qn 741. (1.18) 
i 


Hari 


In particular an unbiased estimate of p7 is 
Co Sy pA oan Ces ae (1.19) 


We emphasize that the results of this paper are based on the assumption that table entries 
are independent Poisson’s, or at least multinomial conditional on the total. The Poisson and 
multinomial models are appealing as they have a ready interpretation, and because sums 
of Poisson variables are Poisson. But sums of multinomials are multinomial only if they 
share the same cell probabilities p. This suggests that conclusions drawn from such models 
may be less accurate if the populations modelled are composed of two or more inhomogeneous 
sroups. 


2. PROOFS 


Proof of Theorem 2.1. Set r = N, mod /. Then (1.1) holds for N = N,, M = M, with 
il = N — rand 


PM gaye CNG = 7\-( 1 ur /1) (Ni — 6 + )2)er/ f= Ne Ip = 7 


Hence 
E(p{) = E(p;’) + n~*A,(p,), Qa) 
where 
l-1 
A,(P\) = E(Mj — Nj) = E(ir— 7?) = YY (i — P)P(N = i) 
7=() 
= ( — 1)/6 + A,(p;) 
since 
/-1 
la Dy 1 1 1) 6. (2.2) 
i=0 
3ut 


E(pi?) = pi + (p, — pin", (2.3) 
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so (1.5) follows. Now p,; = Pp —(M; aN, 3 


so E(p?) = E(p?) — 2n-? Y) E(M\(Mj — Nj)) + 27? SY) ECM; — Nj) (Mj — NV 
= E(pj) = 2n7A, (pi) + 2 An @) 
since E (II;.f;(M;)|{N3) = ILE (f(M;) (Nj). (2.4 


Hence var(p;) = (p; — p7)n | + n~*Z; 2; An(pj) $O (1.7) holds. 


Also, 
E(pi~1) = p, — 2? YY E(M\M) = p, — Y) E(eiPi) 
iA] iA] 
SHR AOR a Rien ONE MONS Tee 
iAl 
SO 
cov(p,,P\) = (pi — pi)n'. (2.5 


Hence var(p,(\)) = (p; — p?)n7! + {C1 — d)7A, (D1) +? Ligy An(p))}n 7? and (1.8 
holds. 


Proof of Theorem 1.2. This was proved for p* in Withers (1987a). Also since f is finit 
in a neighborhood of p, 


f(b) = f(p*) + (i — p*)’f(p*) + OB — p*|’). 
E((p — p*)|N) = 0, E( (A; — pi)*|N) = 2n-7I(N, mod | # 0), 


where /(A) = 1 or O for A true or false, that is, 7(-) is the indicator function 
Hence E(f(p)) = E(f(p*)) + O(n~*) and var(f(p)) = var(f(p*)) + O(n’). 


Proof of Theorem 1.3. This follows directly from (2.1) and (2.3). 
Proof of Theorem 1.4. This follows from (2.1) and (1.5). 


Proof of Theorem 1.5. The first equality in (1.16) follows from (2.4), and the secon 
from the multinomial theorem. Corollary 1.2 follows immediately. 


Proof of Corollary 1.3. From (1.16), for 1 < 7 < i <R we have 


E(t, (p) A) = Fr(P)PiGn, 141 
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sO 2 
E(f;(D) My Bi/Qn1+1) = fx(p) (1 — LY} p,) 


T+] 


= E(f;(P)/an7) — f7(p) L1p;. 
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APPENDIX A 


One expects that for f a smooth function 
E(s(p)I(N, = J; mod /,..., N, = j, mod /)) + f(p)/~* (A.1) 


m7 —co provided 0 <-p;-<.1 for 1 =1< 5s <= R. 

If E(f(p)) = f(p), one expects the rate of convergence to be exponential, O(e~”) for 
some \ > 0. If f(p) is biased, then its bias is O(n ~'), so that one would expect this rate 
also to apply to (A.1). Convergence will in general break down as p approaches the boun- 
dary of [0,1]”, since 


E(f(p)I(M, = J; mod /,..., N, = j, mod /)) 


D(fMIG: =f. ==... =F, =9) itp =0 
J (p)IU,; = n mod /) ip 1: 


To test these expectations we considered the case s = 1, / = 3, j = O and the functions 
(a) f(p) = 1, (b) f(p) = py, and (c) f(p) = exp(p;). Computations were done in quadru- 
ple precision on a VAX11/780, giving a precision for 


A = E(f(p)I(N; = j,; mod /,..., N, = j, mod /)) — f(p)I75 


of 112 bits - nearly 34 decimal places. Figures la, 1b and Ic plot A versus p;forn = 6, 18, 
54. Since n mod 3 = 0, A is symmetric about p, = 1/2 for (a). 

Since A = 2/3f(0) at p,; = 0, and is equal to 2/3, 0 and 2/3 for (a), (b) and (c) respec- 
tively, convergence breaks down at p, = 0 for (a) and (c), but not for (b). Atn = 18, A 
is already negligibly different from 0 for p, in (.2, .8) for (a) and for p, in (.1, .8) for (b) 
and (c). At n = 54, these ranges have grown to cover (.1, .9) for (a), (.02, .95) for (b), and 
m7, .95) for (c). 

Figures 2a and 2b plot Y = log (—log | A |) versus X¥ = log(n) for (a) f(p) = 1 and 
‘b) f(p) = p;. As expected, except for small n, the curves are roughly parallel to Y = X 
except for (b) with p,; = .01), consistent with A = O(e~”) for some \> 0. The curves 
are not smooth, as A has only been calculated at n a power of 2(n = 2' for0 <i =o). 
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Figure la. Evidence for (A.1) When f(p) = 1. 


A 
0.7 


0.6 


0.5 


0.4 


0.3 


0.2 


0.1 


= Oil 


Figure 1b. Evidence for (A.1) When f(p) = P). 
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Figure 1c. Evidence for (A.1) When f(p) = exp(p,). 


y = log(—log|A)) 
‘S) T 


log n 


Figure 2a. Evidence for Exponential Convergence in (A.1) for f(p) = 1. 
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y = log(—log|A)) 


) al = at 
Vines 
Gf ae) 

4 J 
Dina 

3 | 
jx = 0) 

2 = 


log n | 


Figure 2b. Evidence for Exponential Convergence in (A.1) for f(p) = Pi- 


y = —loglA| 


log n 


Figure 3. Evidence for Convergence at Rate ~~! in (A.1) for f(p) = exp()). 
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Figure 3 plots Y = —log| A | versus X = log(n) for (c) f(p) = exp(p,). For n large 
the curves are parallel to Y = X for p, = .5 and .1 consistent with A = O(n~'), but for 
p, = 0.1 the increase is much faster than linear. The graphs generally confirm our expec- 
ations on the rate of convergence in (A.1). To obtain analytic proofs would appear to re- 
uire some sophisticated number theory. 


APPENDIX B 


| Here we compare the moments and cumulants ofp = N/n and p = M/n. Set 
9 = 1 — p,, n; = N; mod J, and m(j) = E(p{I(n, = j)) —p\/l as n —o, assuming 
” ~ 0 or 1. Elementary calculations yield 


u(p) = w(p*) = p, 


M2(Pi) = p2(p}) + Myn~? = piqun | + O(n~?), 


where 
| fs 
Mr, = A, (pi) = J) i(l — i)mo(i) = (? — 1)/6 
i=0 
isn —o, 
pa 
v3(Pi) = w3(p1) + 3n~* (Ui — j?)ma(/) — 2pyms (i) + p?mo(/)} 

| j=0 
rae 
| ner > ajo (J) 

j=0 
| = 13(pi) + o(n~*) = pigs (1 — 2p,)n-? + 0(n7?), 
ind 
| an = —P CU -j/l) + = fy/i. 
milarly 4(P,) has the form p4(p;) + £3 Myn~' = O(n~*) and x4(p,) has the form 


42 Kan‘ where ky, = My does not converge to 0 as n — o. Hence k4(P}) ~ n~, not 
rm. Hence p does not satisfy the Cornish-Fisher assumption that x,(p) = On. \ tor 
'2 1: see for example Kendall and Stuart (1977). 

| Moments and cumulants may also be obtained from the m.g.f. (moment generating func- 


ion), which we now obtain. 


| 


E(exp(t;M,/n) | Ni) = exp(t,;N,/n)S(t,,7;) 


S(t,,71) = (1 = n,/1l) exp(—7,f,/n) ar (n,/1) exp (/ = n,)t,/n. 


Tence by (2.4), the m.g.f. is 


E(exp(t’p)) = E(exp(t’N/n))S(t) where S(t) = Il, S(G, 7;). 
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Also at t = 0, S; = Oandso S;,_. = O if a subscript occurs exactly once. For example 
setting 


rey — DCL, 0; — 0/ dt; S; = 0;S, Sij — 0,0;S, 


E(exp(t’N/n){pi? S + 2p; S, + S33), 


E (pj exp(t’p)) 


E(pr p2 exp(t’p)) = E(exp(t’N/n)(pi2 (px? S + 2pz S2 + S22) + 


2pi(p3" S; + 2p Si2 + Six) + (p>? Sy, + 2p3 Siz + Si122)3) 
Hence E(p7) = E{p;? + S,,(0)} and 
E(ptp3) = E{p;’p)? + Pi” Sy(0) + p>” S\;(0) + Sy122(0)}. 


where S;(0) = S;(0,n;) = He (len, n= Nhe ee (1 — k)kI(n; = k) and S442 (€ 
= §,, (0) Sy.(0). Some further simplifications can be obtained using Ny | N; ~ Bi(é,n — N, 
where 0 = p>/(1 — p;). From the multinomial m.g.f. one obtains | 


E(pi2p37) = n~‘p,p2{(n)aPiP2 + (”)3(Pi + Pr) + (1)23 


where (1) -=9n! (21) ii) eG eb): 
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Variance Estimation for the Canadian 
Labour Force Survey 


G.H. CHOUDHRY and H. LEE! 


ABSTRACT 


The biases and stabilities of alternative variance estimators for the two stage random group design 
(Rao et al. 1962) are evaluated in a Monte Carlo study in the context of Canadian Labour Force Survey. 
The variance formula for raking ratio estimation procedure is derived using Taylor linearization method. 
The properties of the variance formula are investigated by a Monte Carlo simulation. 


KEY WORDS: Keyfitz’s variance estimator; Raking ratio estimator; Taylor linearization; Monte Carlo 
simulation. 


1. INTRODUCTION 


The Canadian Labour Force Survey (LFS) is the largest monthly household survey con- 
ducted by Statistics Canada and is used to produce estimates of various labour force 
characteristics at national, provincial and sub-provincial levels. It follows a stratified multi- 
stage rotating sample design with six rotation panels (Platek and Singh 1976). 

Following each decennial census of population, the LFS has undergone a sample redesign. 
As part of the 1981 post-censal redesign, an extensive program of research was undertaken 
n the areas of sampling, data collection, and estimation methodologies (Singh and Drew 
1981). The post-stratified ratio estimation procedure used in the old design was replaced by 
4 raking ratio estimation procedure to improve the reliability of subprovincial data. This 
Japer presents the results related to variance estimation methodology. 

The methodology for variance estimation for the old LFS was based on Woodruff’s 
seneralization (Woodruff 1971) of the Keyfitz procedure (Keyfitz 1957) using Taylor lineariza- 
ion applied to the post-stratified ratio estimates (Platek and Singh 1976). This method will 
€ called the Keyfitz method as in Platek and Singh (1976). 

There are three area types identified in the LFS design, i.e., self-representing (SR) areas 
Onsisting of major cities, non-self-representing (NSR) areas which are smaller urbans and 
ural areas, and special areas composed of military, institutions and remote areas. For the 
NSR and special areas it was decided to use the Keyfitz method with modification to incor- 
orate the raking ratio estimation procedure. 

However, for the two-stage random group design in SR areas, two alternative variance 
Stimators given by Rao, Hartley, and Cochran (1962) and by Rao (1975) were evaluated 
ind compared with Keyfitz’s method using Monte Carlo simulation. The alternative variance 
Stimators of estimates with and without ratio adjustment were compared with respect to 
heir biases and stabilities. The impact on the Keyfitz variance estimator due to increase of 
he number of replicates was also examined. Details are reported in Section 2. Based on the 
€sults of the evaluation, the Keyfitz method was adopted for SR areas as well. 


G.H. Choudhry and H. Lee, Social Survey Methods Division, Statistics Canada, 4th Floor, Jean Talon Building, 
Tunney’s Pasture, Ottawa, Ontario, KIA OT6. 
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The Keyfitz variance formula for raking ratio estimates used for all area types in the LFS 
is derived in Section 3 and evaluated by Monte Carlo study. Finally in Section 4, some con- 
cluding remarks are given. 


2. VARIANCE ESTIMATION FOR THE SR DESIGN 


2.1 SR Design 


The LFS design in the SR areas is a two-stage random group design (Rao et al. 1962) with 
probability proportional to size (PPS) selection of primary sampling units (PSU’s) and system: 
atic selection of dwellings at the second stage such that the design becomes self-weighting 
Suppose that there are N PSU’s in a given stratum and let x; and M,, j = 1, 2, AD 
respectively be the size measure and dwelling count for the j-th PSU in fhe stratum. hel \ /W 
be the sampling rate in the stratum, where W is an integer, and n be the number of PSU”: 
to be selected from the stratum. The N PSU’s in the stratum are randomly partitioned intc 
n groups so that the i-th random group contains N; PSU’s, and L7_, N; = N. 

Define 


and 
6;; = 1 if the j-th PSU is in the i-th group 


0 otherwise. 


Theni ae=run 1 6p; is the relative size of the /- -th random group. 
Now define W,, the sampling interval for systematic sampling, as follows: Le 

= 6, Wpj/T; and rj, = a; — (4; j) where [a] is the greatest integer less than or equé 

to a. Without loss of generality, we assume that the set {7, / = 1, 2, ..., N} is in descer 
ding order. Then, W,, is defined as | 


eres ol C2 an ah ela lay Ue A age 


| 

S 

~ 
T 
> 
a 
2 


where R = oa 17. Then, by definition poeple W;, = W for the i-th random grou) 
ie 

Since W;, is ihe sampling interval for systematic sampling from the selected cluster in th 
i-th random group, it is defined as an integer for operational simplicity. 

One PSU is selected with probability proportional to W;;’s from each of the n rando. 
groups independently. The selected PSU j from the /-th rancions group is sub- sampl¢ 
systematically at the rate 1/W;;. Then the overall sampling rate in each of the n rando. 
groups is 1/W so that the aie becomes self-weighting with a design weight equal to } 
Each random group is assigned a panel number from 1 to 6. The number of rando. 
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groups 7 is usually a multiple of six so that each panel has the same number of random 
groups. 

Since only one PSU is selected from each random group, we denote by 1 / W; the sub- 
sampling rate in the selected PSU from the i-th random group and by m; the number of 
selected dwellings from the random group i. 


2.2 Alternative Variance Estimators 


Suppose that we are interested in the total of a characteristic y for the stratum. Let Vik 
be the y- oA for the k-th dwelling in the j-th PSU where k = 1, 2, , M;. Then the total 
ee Ae 1 En x21 Yjx can be estimated by Y = W L?_ 1 Yi, Where y; is ine sum of y-values for 
the m; sampled dwellings from the PSU selected from the i-th PLOUP. 17 —elye2> yew 
onsider the following variance estimators for estimating the variance of the estimated total Y: 


1) Keyfitz’s (1957) Variance Estimator 
This estimator was used in the old design with two pseudo-replicates formed by collaps- 


ng the odd numbered panels into one replicate and the even into the other. Ignoring the 
inite population correction (fpc), the variance is obtained by 


Vi(Y) = W? (x i yi) (2.1) 


(2 


vhere © is the summation over all the odd numbered panels and » is the summation over 
ull the even numbered panels. Alternatively, the generalized Keyfitz variance estimator for 
(= 2) replicates which is given by 


V(Y) = ieee y Cn Jy) (2.2) 


vhere y = (1/n) Li_, y;, can be used. In this case each PSU or panel is taken as a 
eplicate. V, was conadentd because it was thought that this variance estimator might have 
etter efficiency (stability) than V, due to its larger number of degrees of freedom. 


2) Rao, Hartley, and Cochran’s (1962) Variance Estimator 


This variance formula is derived under the assumption that the number of secondaries 
1; to be selected from the i-th group is fixed fori = 1, 2, ..., n, and simple random sampl- 
ag (SRS) is also assumed at the second stage. The variance estimator is given by: 


Mi); A ZZ Z Tj 2 1 1 D 
VY) =AW a, uP )isret- SM} (ie ah) cs me 
3(Y) y (je ) du (3 M, (2.3) 


Mm; Pj 


here 


1 (2.4) 


wd 
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Ss} = reas i (Vie — Ji)” (2.5) 


M, is the number of dwellings in the selected PSU from the i-th group and m; out of M; 
dwellings are selected with systematic sampling but the variance estimate is obtained under 
the assumption of SRS. The y-value for the k-th selected dwelling from the selected PSU 
in the i-th group is yj, and ¥; = y;/ mj. 

Since 7;/p; = W/W; and M;/m; = Wi, (these equalities are not strict due to the use of 
integer values for W;), the variance formula (2.2) can be written as: 


i 


n 
(2) =A Yo a (w2 = 
1 


aN f Mm; 
Y) +Ww [ere Wy, pte? 2.6 
) du ( a) “7 


(3) Rao’s (1975) Variance Estimator 


In this case it is assumed that m; secondaries are selected with SRS but, since the design 
is self-weighting, the sample size m; at the second stage is treated as a random variable. The 
variance estimator is given by: 


: ; . M?s? i Tj 


where A is defined by (2.4) and s? by (2.5). After some simplification (2.7) can be written as) 


ae by ay a W, 1 
Vi(Y) = Vi(Y Ww eed Ue elie d A oh spe 2.8 
4(Y) = V3(¥) + y mstt ( ) ( \ (2.8 


We note that there is an additional term, which could be positive or negative, in the variance’ 
formula when random sample size is assumed at the second stage. 


2.3. Monte Carlo Study 


In order to evaluate the biases of the four variance estimators and their relative stabilitie: 
a Monte Carlo study was carried out with 19 Labour Force strata from the Censt 
Metropolitan Area (CMA) of Halifax using data from the 1981 census. The census data fc 
the purpose of this study was the census sample given the long questionnaire which is 20° 
systematic sample of dwellings within Enumeration Areas. The sampling rate 1 / W was take 
to be 0.04 to obtain the same expected sample size as in the actual redesigned LFS. The numbi 
of random groups within each stratum was even and was determined so that the expecté 
sample size within random groups would be as close to 4.5 as possible to correspond to tl 
actual LFS. The 19 strata chosen for the study are shown in Table 1 with the number ( 
PSU’s, the number of selected PSU’s, the number of dwellings, and the expected samp 


sizes along with the corresponding totals for all the strata. Within each of the 19 strata, 1,06 
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Table 1 
Strata Used for the Monte Carlo Study 


eee 


No. of 


No. of No. of Expected 
eatum Dwellings PSUS esate : Sample Size 
ee 

] Ife 49 6 PASS 
% 490 33 4 19.6 
3 745 45 6 29.8 
4 720 34 6 28.8 
5 621 37 6 24.8 
6 630 38 6 pap y) 
7 503 31 4 20.1 
8 340 23 4 13.6 
9 472 33 4 18.9 
10 468 33 4 isis! 

) 11 367 28 4 14.7 
12 390 23 4 15.6 
13 626 36 6 pay) 
14 650 39 6 26.0 
il) 350 22 4 14.0 
16 736 46 6 29.4 
17 573 35 6 22.9 
18 das 48 6 30.9 
19 866 64 8 34.6 

otal 11,057 697 100 442.3 


amples were generated independently using a Monte Carlo technique, employing the ran- 
lom group design described in Subsection 2.1. 

eect Y,, be the estimate of the total Y;, for stratum A from the t-th Monte Carlo draw, 
ml 2, ..., 19, and¢=1, 2, ..., 1,000. Similarly Vin» J=1, 2, 3, 4 are the four variance 
stimators of Y;,. 


Now define 


Ss 
Il 
M4 
= 


ee? = 1, 2,°...5 1000. 
Y, is the estimate of the total Y obtained from the ¢-th Monte Carlo draw and 
ip J = 1, 2, 3, 4 are the corresponding variance estimates. 
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The Monte Carlo expectation and variance denoted by E* and V* respectively are defined 
for T Monte Carlo draws as follows: 


aE ee: 
NC ese 
y*(6) = - > eB) 
ff 


where 6 is an estimator of the unknown parameter 6 and 6, is the estimate obtained from 
the t-th draw. Using these definitions, we obtain the Monte Carlo variance of the estimator 
es VEY), and the Monte Carlo expectations and variances of the variance estimators 
fia (V;) and V*(V;) respectively for j = 1, 2, 3, 4. 
ase define the bias of the variance estimator V; by: 


=the (Vie aye 


and percent bias as: 


B: 
PB; = 100, j = 1, 2, 3, 4. : 
A OG, 


Then the Mean Square Error (MSE) of V; is given by: 


MSE, = V*(V;) + Bj, j = 1, 2, 3, 4. 


We define the efficiency of V;, relative to the Keyfitz variance estimator with two replicate 
(ene as: 


{| 


Rel. Eff(V; vs. Vi) = (MSE,/MSE;)”, j = 2, 3, 4. 


In this study, we consider three labour force characteristics: Employed, Unemployed, ani 
In Labour Force. The relative biases and efficiencies of the variance estimators are reporte 
in Tables 2A and 3A respectively for the three characteristics. We observe that, with respec 
to bias, the variance estimators 1 and 2 are similar and so are 3 and 4. The variance estimator 
1 and 2 have very large positive biases notably for Employed and In Labour Force whil 
3 and 4 have relatively small biases. In efficiency comparison, the variance estimators 3 an! 
4 are much superior to 1 and 2 and very similar to each other. Moreover, the variance estimagl 
2 also performed better than 1. | 

The four variance estimators were also evaluated for ratio estimates by total populatio 
at the level of aggregation of all the strata. The corresponding variance estimators denote 
by V{%), j = 1, 2, 3, 4 were also obtained from each Monte Carlo draw by the Taylc 
Gadieubs method. Then we obtained ratio adjusted version of percent biases of the fou 
variance estimators (Table 2B) and relative efficiencies of the latter three variance estimato: 
with respect to the first one (Table 3B). ) 

We note that the biases of the variance estimators 1 and 2 were substantially reduced fc 
ratio adjusted estimates especially for Employed and In Labour Force. For the varian 
estimators 3 and 4, the biases were also reduced for Employed and In Labour Force bi 
there was very little change for Unemployed. Although the biases of the four varian 
estimators are small, the only nonsignificant bias at 5% level was that of the variance estimati 
3 for In Labour Force. All the observed differences between biases were significant at 5!! 
level except those of the variance estimators 1 and 2 for the three characteristics. 
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Table 2A 


Percent Biases of the Variance Estimators of the Estimates of LF 
Characteristic Totals without Ratio Adjustment 
eters 


Percent Bias 


Characteristic Vi V> V3 V4 

ee ee a re eee ee ee el 
Employed 23.4 24.5 —4,7 — 6.3 
Unemployed 6.3 6.6 37 | 
In Labour Force 24.2 SD) —5.1 —6.7 


Table 2B 


Percent Biases of the Variance Estimators of the Estimates of LF 
Characteristic Totals with Ratio Adjustment 
et ne SS a ae CO ee Ue PE. 
Percent Bias 


Characteristic VAR) VAR) VAR) age 
eee eee reer ee Tere aes ipernity er) Bre tere oer Siseverriny wigs Bese dA ties Silet sae 
Employed Ba 4.3 —1.1 —3.] 
Unemployed 55 OS) 4.0 1.4 
In Labour Force 4.5 5.0 —0.5 —2.5 


ee 


Table 3A 


Relative Efficiencies of V2, V3, and V4 with Respect to V, 
) (Rel. Eff. of V; = [MSE(V,) /MSE(V,)|” , j = 2,34 
ey 
) Relative Efficiency 


Characteristic V2 V3 V4 

erent hemes rises aimnrireae ant A: Ratsnrie: whrty ole ¥ atten Nh ad? On sitsey of TO 
Employed Usa 8922 3.11 
Jnemployed Les2 Lal 1.76 
in Labour Force 1.49 3.24 318 


Table 3B 


Relative Efficiencies of V$*), V{*®), and V§¥) with Respect to V{*) 
(Rel. Eff. of V{*) = [MSE(V{*)) sMSE(VS*))|% , j = 2,3,4 
ene es lilt lll eso etitedetatt ts atin Me cinta it Lib lia 


Relative Efficiency 


characteristic VSR) Vs®) V{R) 
a ke Se ee Pe es eee Se ee ee ee ee eee ee 
3mployed od 3. 2.59 252 
Jnemployed LS 11 1.76 
n Labour Force 2.08 2.56 pay 


154 Choudhry and Lee: Variance Estimation 


Table 4 


Coverage Rates of 95% Confidence Intervals for the 
Estimates of LF Characteristic Totals with Ratio Adjustment 


ccc 


Coverage Rate 


Characteristic V{®) sR) yi V{®%) 
Employed 93.6 95.4 94.6 94.2 
Unemployed 94.3 95.1 9503 95:-@ 
In Labour Force O3e2 95:3 94.6 94.2 


We also computed the 95% confidence intervals (CI’s) for the ratio-adjusted estimates 
from each Monte Carlo draw using the four variance estimators. The coverage rates were 
obtained as the proportion of CI’s which include the true value of characteristic total. The 
results are given in Table 4 and show that the performances of all the 4 variance estimators 
are very good for all the characteristics. Since the variance estimators of ratio- adjusted 
estimates provide confidence intervals which have coverage rates very close to the nominal 
value, the small biases of the variance estimators are of no practical consequence. Thus, from 
the bias point of view, all four variance estimators for the ratio-adjusted estimates are not 
much different from each other. The relative efficiencies of the variance estimators 3 and 
4 are now only marginally better than 2 regardless of characteristic. The relative efficiencies 
of the 3 alternatives in this case are over 2 for Employed and In Labour Force. For unemployed! 
they are somewhat lower and lie between 1.5 and 1.8, which are almost the same as those 
for the unadjusted case. We should note here that the variance estimator 1 is computed with 
19 degrees of freedom (1 per stratum). On the other hand, in the case of the 3 alternatives: 
we have 81 degrees of freedom since each PSU is a replicate. Hence, we conclude that the! 
stability of the Keyfitz variance estimator for the ratio-adjusted estimates is significantly im- 
proved by increasing the number of replicates and becomes comparable with the other two 
alternatives (see Table 3B). | 
2.4 Keyfitz’s Variance Estimators with 2 vs. 6 Replicates for the LFS | 


The results of the Monte Carlo study reported in the previous sub-section have shown, 
that the Keyfitz variance estimator compares well with the alternate methods for the variances 
of the ratio-adjusted estimates both from the bias and efficiency point of view when each; 
method uses the same number of replicates. In addition, Keyfitz’s method has the advantage 
of simplicity and estimating the variances of changes and averages under the alternative 
methods involves many complications. Therefore, the Keyfitz method was retained for the SR, 
areas as well. In order to improve the efficiency of Keyfitz’s method, 6 rotation panels were 
adopted as replicates as opposed to 2 replicates in the old design. One major concern witf, 
using the rotation panels as replicates was whether there would be any serious inflation o} 
the variance estimate due to panel bias. 

This aspect was investigated for the three LF characteristics by computing the variance 
estimates using the variance formula developed in Section 3 with 2 and 6 replicates from 
the actual LFS data for 24 months (March ’85 - February 87). From the 24 estimated variance!| 
for each of the LF characteristics, the means and standard deviations (SD’s) of the variance: 
were obtained. The ratios of the means and SD’s of the variances under the two alternative: 
(2 vs. 6 replicates) are averaged over 24 Census Metropolitan Areas (CMA’s) and given ir 
Table 5. The following observations can be made from the table: 
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Table 5 


Comparison of SR Variance Estimates with 2 vs. 6 Replicates 
per Stratum Based on CMA Data of the LFS 
Mar ’85 - Feb ’87 
——————_—_————————— EE eee eee eee 


Average Ratio of Average Ratio of 
Characteristic Means of Variances SD’s of Variances 
(2 vs. 6) (2 vs. 6) 
Employed 0.997 1.813 
Unemployed 0.995 i515 
In Labour Force 1.003 1.833 


Note: For each CMA, means and standard deviations of variance estimates were obtained from 24 months data 
for 2 and 6 replicates. Then the ratios (2 rep. vs. 6 rep.) of means of variances and of standard deviations 
(SD’s) of variances were calculated for each CMA. The average ratios in the table are the averages over 
24 CMA’s. 


(i) The effect on the levels of the variances due to using 6 replicates as compared to 2 is 

very minimal, which means that adopting rotation panels as replicates has little impact 
on the bias of the variance estimates. 

di) As expected, the variances are more stable with 6 replicates than with 2 and the results 


are not much different from those of the Monte Carlo study (see the first column in 
Table 3B) 


From the above observations, we conclude that the efficiency of the Keyfitz method is 
mproved substantially without having serious impact on the bias by adopting the 6 rotation 
oanels as replicates as opposed to using only 2 replicates. 


3. VARIANCE ESTIMATION FOR RAKING RATIO ESTIMATES 


3.1 Raking Ratio Estimation for the LFS 


In the old LFS, post-stratified ratio estimation was used. The subweight, which is the design 

weight adjusted for non-response, was ratio-adjusted to external estimates of the LFS target 
Jopulation for 38 post-strata defined by age and sex at provincial level. The LFS target popula- 
jon is the population 15 years of age and over excluding armed forces, inmates of institu- 
ions, and population living on Indian reserves. 
This ratio estimation enhanced the quality of provincial data substantially but subprovin- 
vial data still had somewhat poor reliability. In order to improve subprovincial data especially 
‘or Economic Regions (ER’s) and Census Metropolitan Areas (CMA’s), a raking ratio estima- 
jon procedure was adopted, through which simultaneous ratio adjustment at provincial and 
subprovincial levels is achieved. 

The raking procedure is carried out in a sequence of adjustments: first, the subweight 
S$ adjusted to the subprovincial (CMA’s and Non-CMA parts of ER’s) population and then 
he provincial level adjustment by age /sex (the number of age /sex groups were reduced from 
38 to 24 in the redesigned sample) is applied to the resulting weight. This procedure is repeated 
ynce more to obtain a second pair of weights. Note that for the ER’s containing CMA(s), 
he CMA part is excluded when defining adjustment cells for the ER’s so that the subprovin- 
“ial adjustment cells are mutually exclusive. Let Wp be the subweight and let (W,, W,) and 
(W;, W,) be the two pairs of weights resulting from the first and second iteration respec- 
‘ively. Labour force characteristics are estimated using W,. Due to the order of adjustments, 
he marginal totals of W, at provincial age /sex groups are exactly the same as the external 
dopulation estimates of the corresponding groups but the marginal totals of W, at 
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subprovincial level (ER and CMA) are not quite equal to the corresponding external popula 
tion estimates. However, the differences are very small. 

The special area frames, which are composed of military establishments, institutions, ang 
remote areas, in general, do not respect the ER and CMA boundaries and hence, are treatec 
differently during the raking procedure. Each special area type forms a stratum at the pro 
vincial level. The only exceptions are remote areas in the provinces of Quebec and Alberti 
where further stratification is carried out. Those ER’s and CMA’s which contribute to thi 
special area frame will be called ‘‘contributing’’ ER’s and CMA’s. The special area record 
on the sample file are copied to each of the contributing ER’s or CMA’s with deflatec 
subweights in proportion to the population of that particular type of special area in the con 
tributing ER or CMA. The raking procedure is then carried out in the usual manner as describ 
ed earlier. 


3.2 Variance Formula for One-Iteration Raking Ratio Estimates 


The variance formula for one-iteration raking ratio estimates is derived here. The basi 
methodology employed here is successive application of Taylor series approximation to th 
raking ratio estimates until we obtain a linear form of subweights. Then the replication for 
mula is applied as in Woodruff (1971). The successive application of the Taylor series ap 
proximation was also used by Arora and Brackstone (1977a,b) and Brackstone and Rao (1979 
to obtain variance formula of raking ratio estimates for simple random sampling of unit 
or clusters. We have adopted this method for the stratified multi-stage PPS sampling desig) 
following Woodruff’s approach. 

Let Y, y“), y) be the estimates of a labour force characteristic y in a province bas 
ed on W), W,, and W,, respectively. The superscripts in parentheses correspond to th 
subscripts of W’s. 

Then Y'?) can be expressed as follows: 


(2) You 
Y — » PO ] ee (3.1 
a a 
where Y{') = W,-weighted estimate of characteristic y for the age/sex group a in th 
province, | 
P{) = W,-weighted estimate of population for the age/sex group a in the province 
P, = External estimate of population for the age/sex group a in the province. 


Let F, = Y{?/P"). The first order Taylor approximation to F, at (E(Y!), E(P“)) / 


ECY SD) 1 EGYAa) 
poe a Pe ser ry co) aes a (aye (1) 
E(P) * EP) f ee } (E(PiY)}? & ans } 


where F denotes expectation. | 
Then a Taylor approximation to the variance of Y'?) can be written as 


VY?) = hws ie) “ aby ts CV ue Rips} Bi 


a 
where 


ee) 


R 2 = ———_., 
Y E(Pi)) 
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Now the W,-weighted estimates Y{") and P{") can be expressed in terms of Wy-weighted 
estimates as follows: 


(1) yo? 
oe: pio Ps 
5 Ss 
(3.3) 
P 
a) Sa 
Hohisy Ly pio Pe 


where s denotes a CMA or an ER or the complementary part of an ER after removing the 
CMA part and P, is population of the subprovincial area s. Substituting the expressions for 
Y{" and Pi") from (3.3) into (3.2) and applying the first order Taylor approximation to 
the ratios of Wo-weighted estimates, we obtain 


PP, P 
ayy eee aed al (CO) 0) p(0) 
oil Nes Sarecestgd ceria ee nar) 
a 


Ss 


~ RY (re ~ RY POY, 3.4) 


where 


Ro) = BUYS?) 


BCP) 
AYE E(P{) 


and Rj?) = ; 
aE.) 


The expression in (3.4) can be written in terms of replicate level estimates. Define 


Jee Pp 
Ol ise Sry beta eee 55 St ol Op 0) p(0) 
Z shia E(P!) E(P.) CY cha RY Jean )s 
(3.5) 
P P 
fi, Top ie LR bs Pas 


E(P§)) E(P) 


where h denotes a stratum belonging to s and i denotes a replicate in h. 
Then (3.4) can be rewritten by rearranging the order of summations as follows: 


vy =vIP yyy (Zi8ho — Rip Zi) 
ial a 


Ss hés 


= He te ¥* Di (3.6) 


S Resse 
where 


D?= 3 (Zhe ~ RY Zia) 


a 
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Apart from special area strata, (27/4, iD ;)’s are independent because they are based on 
subweights. However, for the special area strata they are highly correlated because the same 
records are attributed to the contributing subprovincial areas. 

We can rewrite (3.6) as 


eee a7] Bae y" Dw 


hes h j= 


V LU y yy Dan (3.7) 


i=1  s3h 


where 4,3, is summation over all the subprovincial areas containing the stratum A. For a 
non-special stratum, the stratum appears only in one subprovincial area, and the summation 
(L,5,) is redundant. However, a special area stratum could appear in several subprovincial 
areas and the summation (%,;,) sums up all D-values (D‘?). belonging to the special area 


stratum. 


Define | 
DiP’= Y Ds. 
sah 
Then (3.7) becomes | 
VY) =V » > DON 3.8 


| 
ca 
f 
\ 
} 


The variables, £; D{?), are independent since they are based on subweights. Then, ignor- 
ing the fpc, the variance can be estimated by 


A n = | 
ACE a Ligereteye keh a2 ay | 
Are ra G q 
where 
T (9) 1 (0) | 
D = > Djj 2 
h i iD h 


In this expression, however, expected values are involved and these are unknown. Th 
variance can be approximated reasonably well by substituting expected values with thei’ 
estimates and hence, from (3.9), we obtain the final form of V as follows: 


Vix 


— Die)? (3.10 
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where 
sah 
(2) 1 2 
Di 
COS 
De = 5 (z 2g RIPZEEhe). 
a 
Prose vASy 
Zon nanl cls Ge ne a) 
P{) pi) P£°) 
P{) 
= YR, — 8 yer 
PO 
1B, Je P?) 
ZB hin = “ * (Pie -—= Pi) 
P{) p0% P£°) 
re 
= Pie ral obra 
BY 
and 
eg =U. Pe) ye 
a a —, — . 


The formula (3.10) gives the variance for W5-weighted estimates of LF characteristics and 
requires two weights Wy and W). 


3.3. Application of the One-Iteration Variance Formula to Two-Iteration Raking Ratio 
Estimates 


The variance formula for the two-iteration raking ratio estimates can be obtained by suc- 
cessive application of the Taylor linearization as described in the previous section. However, 
the formula thus obtained is very complex. It was conjectured that the variance formula for 
one-iteration would be a reasonably good approximation for estimating the variance of the 
two-iteration raking ratio estimates. The rationale behind this conjecture was that there were 
only small perturbations in the weights after the first iteration. Now, the one-iteration variance 
formula uses the pair of weights (Wo, W,). However, it was decided to use (Wo, W4) in- 
Stead of (Wo, W 2) since it was found that the use of W, instead of W, does not have any 
impact on the CV’s of LF estimates which are based on W4. The one-iteration variance 
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formula using the pair of weights (W>, W,) will be referred to as the one-iteration variance 
estimator. 

To verify our conjecture, a Monte Carlo simulation study was carried out using the 1981 
Census data from the province of Nova Scotia. In each Monte Carlo sample, the LFS design 
was simulated through all stages of sampling and a total of 1,000 Monte Carlo samples were 
selected independently. For each Monte Carlo sample, the following statistics were calculated, 
for the three labour force characteristics at subprovincial and provincial levels; 


1. Two-iteration raking ratio estimate, Y"). 


2. Variance estimate V(Y")) using the one-iteration variance estimator and the cor- 
responding estimate of CV. | 


3. 95% confidence interval (i.e., Y“ + 1.96 VV(Y™). | 


At the end of simulation, the average of 1,000 CV’s was computed and compared with 
the Monte Carlo CV which is very close to the true value. The results are given in Table 6A. 
In all 21 cases (3 characteristics for each of 7 areas) the differences are less than 8% and 
in 13 cases less than 4%. 

Also, the proportion of confidence intervals which cover the true characteristic value was 
obtained. The results are shown in Table 6B. Coverage rates for Employed and In Labour 
Force are very close to the nominal value in general, whereas those for Unemployed are 
somewhat lower but still acceptable. 

It was also found that the two-iteration raking ratio estimate is nearly unbiased with a 
maximum of 0.35 percent bias in all 21 cases. 


Table 6A 


Average CV’s Obtained by the 
One-Iteration Variance Estimator and the Monte Carlo CV’s 


Characteristic ER ER ER ER ER CMA Province ' 
210 220 230 240 250 Halifax (Nova Scotia 


Average CV’s 


Employed 3.52 3.46 3.14 3.05 1.96 2.01 1.08 | 
Unemployed 10.36 12.28 ifs). 13} 13.43 10.35 10.55 i) | 
In Labour Force 2.98 gyi ly 2.85 DS Lee 1.83 0.91 


Monte Carlo CY’s 


Employed 3.48 3.35 2.95 2.86 1.97 1.99 eal! 

Unemployed 10.90 12.71 13.28 [3.37 112 iiss 5.59 | 

In Labour Force 2.76 3.08 2.76 DAS thie 1.74 0.92 | 
| 


: 

Table 6B | 
Coverage Rates of 95% Confidence Intervals 
Constructed by the One-Iteration Variance Estimator 


cay Fi ee ie a See rR ANH Fie CMA Province 
Characteusilc 210° «220~=*«<~*ST]si(‘(itiKSS*«iSS*é«Salifax =” (Novas ‘Scoot 
Employed QdiSisiuiy OB Al tee SdeOab (FANT tl 9457 94.9 92.5 


Unemployed 92:1 9057 91.4 O18 OD es 7 934 ) 
In Labour Force 96.2 eis) 93.6 O52 y PPS 96.0 94.0 | 
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4. CONCLUSIONS 


It has been shown that the Keyfitz variance estimation method for estimates without ratio 
adjustment (in this case it becomes just a replication method) has very large positive biases 
and low efficiencies while the alternatives have negligible biases and higher efficiencies for 
the labour force characteristics considered in this study. 

However, for the ratio-adjusted estimates, all the methods considered here have negligibly 
small biases. It has also been shown that the efficiency of the Keyfitz method can be improv- 
ed substantially and made comparable to the alternatives by increasing the number of 
replicates. It was demonstrated using actual LFS data that using 6 rotation panels as replicates 
in the Keyfitz variance estimator as opposed to 2 pseudo replicates does not introduce bias 
due to the phenomenon of rotation panel bias. As shown by Monte Carlo results, the one- 
iteration variance formula derived by the Keyfitz method using Taylor linearization gives 
reasonably good variance estimates for the two-iteration raking ratio estimates and has good 
coverage properties. 
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The “‘AGEVEN”’ Record: A Tool for the Collection 
of Retrospective Data 


PHILIPPE ANTOINE, XAVIER BRY and PAP DEMBA DIOUF! 


ABSTRACT 


Because it is easy to use, the ‘‘AGEVEN”’ record makes it possible to date events more precisely and 
to classify retrospectively demographic events (births and deaths), changes in marital status and changes 
in place of residence. The data collected are used to accurately recreate the socio-economic conditions 
that were present when the demographic events being studied took place. 


KEY WORDS: Retrospective survey; Biographies; Demographic survey. 


1. INTRODUCTION 


Two major data collection methods are available to demographers to collect data on natural 
movement (natality and mortality): longitudinal observation and retrospective questionnaires. 
he longitudinal observation method (following a population sample over a relatively long 
period of time) is, in theory, the method which provides the most accurate results. It does, 
n0wever, have its drawbacks. It is expensive because of the amount of travel required for 
observation, and a relatively lengthy period of time is needed to obtain results. Finally, in 
irban areas, the method is difficult to apply because of the high degree of mobility of the 
opulation, which leads to a significant deterioration of the sample, such as that encountered 
n IFORD’s infant and child mortality surveys (Scott 1985; Fargues 1985). 

The retrospective method gives less reliable results because it depends more on the memory 
of the respondents. However, the total observation period is generally longer than that of 
he longitudinal surveys introduced in recent years in African countries. The risk of omitting 
vents remains high and dating them is inaccurate. Finally, in urban areas there is a tenden- 
y when reconstituting the past to mix events which took place in the city being surveyed 
vith other, earlier events, which took place in other places of residence (urban or rural). 

Since we wished to determine mortality and fertility differences in Pikine, a suburb of 
Jakar, and also wished to obtain fairly reliable results quickly, we selected a data collection 
nethod that would enable us to recreate accurately the infant and child mortality risk fac- 
ors at the time of death of each of the children of the women surveyed. The survey was 
onducted jointly by the Senegal Statistics Branch and Orstom (Antoine et Diouf 1986). The 
ield work was carried out between March and May 1986. The first results were available 
n September 1986. The method we selected is different from the retrospective method most 
requently used, which takes into account only the socio-economic and cultural characteristics 
if the women at the time of the survey. These characteristics could, in fact, have changed 
onsiderably during the women’s child-bearing years (improvement or deterioration of liv- 
1g conditions, change of marital status, change of activity, and so forth). Our method makes 
possible to better assess the relationship between urban insertion and changes in demographic 
ehaviour. The following objectives determined our collection strategy: 

to obtain a complete list of the events observed (mainly births and deaths); 


Philippe Antoine, demographer, and Xavier Bry, statistician, ORSTOM, P.O. Box 1386, Dakar, Senegal; Pap 
Demba Diouf, demographer, Statistics Branch, P.O. Box 116, Dakar, Senegal. 
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- to date these events as accurately as possible; 
- to place the events in their socio-economic context (marital status, professional status of 
the husband and wife, living conditions). 


2. COLLECTION AND DATING OF DEMOGRAPHIC EVENTS 


To conduct a successful retrospective survey means, in particular, establishing as accurate 
a biography as possible (in relation to the field studied) for each person surveyed. A method 
has to be found, therefore, to situate past events chronologically. 

A number of methodological improvements have been proposed in the past. Ferry (1977) 
used an ‘‘event file’’, which involved assigning a record to each event. According to the author, 
the originality of this method lay in placing the events in order together with the person 
surveyed (pregnancies, marriages and divorces, places of residence and so forth) and situating 
them in relationship with each other. The technique consisted in recreating, with the person 
surveyed, the succession, logic, interferences and, finally, the individual biography. However, 
it is a relatively complex method and involves handling numerous records in the field and, 
during processing. 

Another method of classifying and dating events was used in the Senegalese survey on 
fertility in 1978: the ‘‘AGEVEN”’ graph. There were two reasons for using the ‘““AGEVEN”, 
graph in the Senegalese survey: 

- to make it possible to better estimate the age of the women and their children with the 
help of relatively precise dating; | 

- to make it possible to accurately estimate fertility by preparing the pregnancy histories 
of all the women. 


The ‘‘AGEVEN”’ graph used in the Senegalese fertility survey (Figure 1) plots two curves. | 
The righthand curve describing the lifeline of the woman (LL curve) is graduated in intervals 
of three months, making it possible to plot inside a year the events affecting the woman. 
The lefthand curve, called the AE (age of events) curve, indicates the time which has passed 
between the event and the date of the survey. Thus, an age on the AE curve corresponds 
to each year on the LL curve, and vice versa. This graph, which was also used in the Ivory 
Coast fertility survey, seems to be mainly an instrument for dating events. | 


3. USE OF THE ‘‘AGEVEN”’ RECORD IN THE PIKINE SURVEY | 


We tried to combine some of the advantages of each of these collection methods: the 
‘“AGEVEN”’ graph, which is easy to use to date events, and the event file, which makes 
it possible to take various kinds of events and to classify them in relation to each other. We 
systematized the ‘‘AGEVEN’’ record by distinguishing between demographic events (births, | 
deaths), changes in marital status and changes in place of residence. For convenience, ve 
retained the name given the graph used in the Senegalese fertility survey for our record, bui 
while the name is the same, the uses which can be made of it are different. The ‘“AGEVEN” 
record (see Figure 2) contains three columns: ) 
— the first covers demographic events (births (B); deaths (DT); abortions (A); miscarriage! 

(MC); stillbirths (SB)). Each event (birth or death) must be followed by its chronologica: 

ranking, the first and last names of the child and, possibly, the exact date; 

~ the second column covers matrimonial events and the chronological ranking of each 0° 
the spouses or partners (marriages (M); divorces (D); widowhood (W), the rank of the 
various fathers (indicated as Fl, F2, ... Fn). 
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Figure 2. Example of use of the “‘AGEVEN”’ record. 
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- the third column indicates the place of residence at the time of each of these demographic 
and matrimonial events. This column makes it possible to follow the migratory paths of 
the women and to determine the date of their arrival in Pikine. 

The ‘‘AGEVEN”’ record is a methodological tool that serves various purposes: 
-situating events chronologically; 
helping the woman situate chronologically events for which she has forgotten the date; 
ensuring that all the demographic events lived by the woman surveyed are recorded; 
identifying changes of residence and the location where events took place; 
checking the consistency of events among themselves. 

The interview consists of two phases: one involving the household and the other involv- 
ng the women between the ages of 15 and 49. The ‘‘household’’ questionnaire, which lists 
Il members of the household, whether currently residing in the household or not, deals in 
articular with the filiation of the persons surveyed, their blood relationship with the head 
f the household or ‘‘nucleus,’’ their sex, their marital status, and their date of birth or age. 
he ‘‘women’s’’ questionnaire concerns all the women, resident and present in the household, 
etween the ages of 15 and 49. The ‘‘AGEVEN”’ record is used to complete this questionnaire. 

To transcribe the data collected on this record, the investigator can take various points 
f reference (the date of birth of the woman, the date of birth of her first child, and so forth) 
nd, with the help of the respondent, reconstitute her entire lifeline, namely all the other 
vents which have taken place during her life, such as marriage, divorce, and various pregnan- 
ies. This operation may be broken down as follows: 

. After recording the first live birth, the investigator asks the respondent to state all subse- 
quent live births, in chronological order, indicating whether or not the child is still alive 
and whether or not he or she is still living in the household. 

. The investigator then records these births on the record, using the official documents shown 
to him. In our case, official documents were available mainly for children born in the 
Dakar area. For the age of the women, however, as well as for the birthdates of some 
children, the investigator has to rely on elements in the historical calendar to determine 
the dates (month and year). 


The “‘AGEVEN”’ record makes it possible to situate events according to the age of the 
oman at the time of the event, the time which has passed since the event took place, or 
1¢ date of the event. Any large gap between two births or other inconsistency between two 
vents is easily detected during the interview with the woman. 

It is also possible to use the ‘‘AGEVEN”’ record to check the consistency of events. For 
sample, two children cannot be born within nine months of each other; a woman cannot 
iy that she was married at age 12 and had her first child in 1970 at age 14, and then go 
n to say that she was born in 1950. In the latter case, there is likely an error in the date 
f birth of the woman and it should be corrected. 

The record makes it possible to record both events for which an exact date is given and 
ents for which only an age is given (such and such a child is now ten years old; I was mar- 
ed 15 years ago). Finally, with the help of this record, events for which the date is not clear 
in be situated. For example, such and such a child was born between the one born on 10-2-74 
id the one born in 1978. It is highly likely that this child was born in 1976. To use this 
cord successfully, the investigator must take a critical look at the chain of events and must 
y to make it as complete as possible, taking care to check the reliability and consistency 
‘the responses provided. This is possible only if confidence is established in the dialogue 
ith the respondent. 

After having recorded all the live births declared by the respondent, the investigator turns 
| the intervals between successive births. All events are not always reported in the initial 
sponses, but by using the ‘‘AGEVEN”’ record, the investigator can track down the 
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omitted events. The investigator thus asks himself what happened each time an interval of! 
more than two years is recorded between two live births. The responses provided by the respon: 
dent may reveal abortions, stillbirths, death soon after birth, information obtained on con: 
traceptives, and so forth. Although this was not an objective of the Pikine survey, the dialogue 
that is established can make it possible to delve deeper into matters relating to family planning 

Each of the events is linked to the location, marital status and partner of the woman a) 
the time of the event. After recording all the events affecting the woman, the investigato) 
then has to estimate more accurately the date of birth of the mother. The investigator ha: 
in fact already recorded the date of birth of the mother, as indicated either by the womar 
or the head of the household, when completing the ‘‘household’’ questionnaire. Now, i 
a one-on-one interview with the respondent and having recorded the events which affectec 
her, he can provide the best possible estimate of the respondent’s age. 

For example. Awa was born in 1956 in Kaolack. She says that she has had three children, 
Ibrahima, who would now be 10 years old, born in Dakar, died at age 4 in Pikine; Abdoul 
born on January 5, 1978 in Dakar; and Aminata, born on December 18, 1984 in Pikine} 
Awa was married for the first time at age 17 in Thies. She was divorced in 1979 (while livin; 
in Pikine). She remarried in 1982, at which time she was still living in Pikine (see Figur 
2). During the interview, the investigator will notice a gap of almost 7 years between Abdou 
and Aminata. He should ask whether there were other births or pregnancies during this period) 
In the case of Awa, the divorce and subsequent remarriage three years later may explai) 
the gap. However, the investigator must check with the woman to ensure that the gap doe’ 
not hide other demographic events. 

The interactive form of the interview seems to encourage dialogue with the respondéll 
and improves contact between the investigator and respondent, which is unfortunately onl! 
too often clouded by doubt on the part of the investigator and mistrust on the part of thi 
respondent Bonnet (1984). As the investigator continues his or her investigation, new event 
are mentioned. When he or she asks whether there was another event between two birth 
separated by more than two years, the respondent is often surprised and responds in on 
of two ways. If no event has occurred, she asks, ‘“Why do you ask that?”’ If, however, a/ 
event has indeed occurred, she often asks, ‘‘Who told you that?’’ since she has the imprest 
sion that the investigator already knows something. The ‘‘AGEVEN”’ record becomes a kin 
of crystal ball, like the cowry shell. Sometimes the interview becomes a game, and the respor, 
dent is pleased to place past events in order. A woman with a complicated marital an) 
reproductive history may even want a copy of her ‘““AGEVEN”’ record. As in any surve} 
there are problems with the use of this record. Sometimes it is difficult or awkward to t! 
alone with the respondent, and often women are embarrassed if the record brings up even’ 
concerning a partner preceding the current husband. 

In practice, the record is incomplete because there is no question which eliminates poss 
ble confusion between stillbirths and infants who die shortly after birth. This kind of confi) 
sion often arises in responses given in the Wolof language, in which it is difficult to distinguis 
between miscarriages and abortions and between stillbirths and deaths immediately after birt! 
Some French terms or words cannot be translated directly into Wolof. For stillbirths, fc 
example, there is no single question that elicits the desired response. At least two question 
are therefore required. When confronted with an interval between successive births, the ii] 
vestigators asks the following question, for example: ‘“Lou am dikhane te Moussa ak Ali? 
(what happened between Moussa and Ali?). This question correctly leads the women i 
stillbirths, abortions, miscarriages and so forth. To elicit a satisfactory response, clarifie! 
tions are needed: ‘‘Dikhane té Moussa ak Ali, amo fi dom diou dé guinaw bou mou ing 
bakhane?’’ (did you have a child who died after giving some sign of life between Mous 
and Ali?). The confusion results mainly from the fact that the distinction between a misca | 
riage and stillbirth is not always clear and from the fact that a child is not given a nan 
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until he or she is a week old. Also, for certain ethnic groups, it is not until the child has 
a name that he or she is really taken into account. A column indicating whether or not the 
infant cried at birth would therefore have been very useful. 

_ The ““AGEVEN”’ record used in the Pikine survey did indeed provide more satisfactory 
data than the graph used in the Senegalese fertility survey, in terms of both the nature and 
quantity of data collected. However, it did not eliminate the tendency to round off the inter- 
vals between successive births in years (approximately 37% of the intervals), particularly in 
intervals of two years, which account for approximately 20% of the intervals observed bet- 
ween successive births. In addition, it was not possible, using this technique, to list all the 
issue of young girls who had been pregnant but who had had no live births. Some biases, 
which are certainly classic in demography, do persist therefore, and this method does not 


eliminate the need to take extreme care in the field. 


} 
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i TRANSCRIPTION FROM THE ‘‘AGEVEN”’ RECORD TO THE QUESTIONNAIRE 
AND ELECTRONIC DATA PROCESSING 


The questionnaire regarding the reproductive history of the women was designed in such 

a way as to permit the best possible transcription of the data collected using the ‘“AGEVEN”’ 
record. First, the characteristics of each of the children are noted in chronological order by 
birth, along with the date of death, if appropriate. The investigator then records the marital 
status at the time of each of these events in order to note any possible change in spouse. 
Then, changes in the socio-economic situation of the father and mother are taken into ac- 
count, as well as changes in living conditions and in place of residence. The survey also in- 
cluded other questionnaires regarding the characteristics of the household, individuals and 
women observed. 
The data collection method allows for two kinds of analysis. The first involves a classical 
analysis of mortality by generation and sub-population (according to neighbourhood, type 
of housing and so forth). However, what is especially interesting about this study is that 
it allows for analysis of mortality (and fertility) taking into account migratory behaviour 
and changes in the socio-economic conditions of the women surveyed. When this method 
is used, mortality is no longer interpreted solely according to the socio-economic conditions 
at the time of the survey. Rather, it is related to the conditions which really existed at the 
‘ime of the event, and it is therefore possible to better understand the differences relating 
specifically to living conditions in urban areas (Pikine in this case). 

Depending on the place of birth of the child, different morality rates were recorded. Many 
of the respondents are migrant women from other cities or from villages in the interior of 
‘he country. Children born to them in rural areas suffered a significantly higher risk of mor- 
cality than those born in the Dakar area. 

The child mortality rate (between 1 and 4 years) clearly reveals the risks resulting from 
30cio-economic differences. The risk of dying between the ages of 1 and 4 is 2.84 times higher 
for children born in villages than for those born in Pikine. The z-test shows that the dif- 
ference between the two rates (Pikine mortality rate and rural mortality rate) is significant. 
a tested the hypothesis that the mortality rate for children born in Pikine is the same as 
hat for children born in rural areas. Since the sample sizes are relatively large, approxima- 
" using the normal distribution is justified. Under the hypothesis that the mortality rates 
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Table 1 
Mortality by place of birth (in thousands) 


yh Other ney Pkn-Rural 
Pikine Dakar Cities Rural ota Test 
Infants 52 57 45 114 58 -—6,586** 
Children 55 62 90 156 68 -10,093** 
Population 5155 1513 644 704 8016 


are equal, the z-statistic is distributed as a standard normal variable. The symbol ‘‘**”’ in- 
dicates a significant difference at the wa = 0.05 level. Classic restrospective data collection 
without distinction as to the place of birth of the child would have led us to class births out- 
side Pikine with those inside Pikine and would have resulted in a higher mortality rate (child 
mortality rate of 68 per thousand rather than 55 per thousand). 

Moreover, a second analysis can be made for each of the women observed. A simplified 
biographical file can be created in which the successive stages are defined in terms of births, 
A relationship is thus established between matrimonial events, changes in residence and 
reproductive data. The principal stages in the migratory path followed since the birth of the 
first child, or since marriage, can also be reconstructed. Longitudinal data gathered in this 
way lend themselves very well to recent methods for the analysis of interference between 
phenomena (Courgeau and Lelievre (1986); Cox and Oakes (1984)). 


5. CONCLUSION 


The data collected for each of the variables are very brief, but they should make it possi. 
ble to detect some significant differences and to determine the living conditions at the time 
of birth and death. The collection methodology used is adapted to the collection of date 
on the reproductive histories of the women and the destiny of their children. The main ad: 
vantage of the ‘“AGEVEN”’ record is its facility in pinpointing various events chronologicalls! 
and in classifying these events in relationship with each other, without eliminating the possibili 
ty of inserting events omitted as the interview proceeds. The flexibility of the ‘‘AGEVEN’ 
record leads us to suggest that it could be used in other fields, for professional biographie: 
or migratory routes, for example, by establishing a parallel between place of residence, pro 
fession, marital status, family situation, living conditions and so forth. A great deal o 
methodological research has been conducted in the analysis of demographic biographie:, 
(Courgeau 1984; Haeringer 1972; Riandey 1985). Our method is intended merely as a simpli 
and reliable tool for the collection of data. It is up to each user to determine which variable)’ 
he or she wishes to arrange chronologically using the ‘‘AGEVEN”’’ record and, once the 
biographical framework has been collected, to obtain more data on the field(s) he or she 
is studying, using the questionnaire. 


ACKNOWLEDGMENTS 


| 
The authors would like to thank the referees for their helpful comments. | 
| 


Survey Methodology, December 1987 fi ae 


REFERENCES 


ANTOINE, Ph., and DIOUF, P.D. (1986). Changements démographiques en milieu urbain. Paper 
presented at Séminaire sur la mortalité au Sénégal. Dakar. 


BONNET, D. (1984). Occultation, omissions. Quelques problémes soulevés par l’enquéte quantitative 
en matiére de santé. Medicus Mundi, 11. 


COURGEAU D. (1984). Relations entre cycle de vie et migrations. Population, 39, 483-513. 
COURGEAU, D., and LELIEVRE, E. (1986). Nuptialité et agriculture. Population, 41, 303-326. 
COX, R., and OAKES, D. (1984). Analysis of Survival Data. London: Chapman and Hall. 


DIRECTION DE LA STATISTIQUE (1981). Enquéte Sénégalaise sur la Fécondité, 1978 - Rapport 
National d’Analyse, 1. 


FARGUES, Ph. (1985). L’évaluation du niveau de la mortalité a partir des données des enquétes EMIJ. 
Les enquétes sur la mortalité infantile et juvénile (EMIJ). 1, 60-84. 

FERRY, B. (1977). Le fichier é€vénement. Une nouvelle méthode d’ observation rétrospective. In /’Obser- 
vation démographique dans les pays @ statistiques déficientes. Liege, Belgium: Ordina Editions, 
137-150. 

HAERINGER, Ph. (1972). Méthodes de recherche sur les migrations africaines. Un modéle d’inter- 
view biographique et sa transcription synoptique. Cahiers ORSTOM, 9, 439-453. 

SCOTT, Ch. (1985). Les problémes de déperdition dans les enquétes suivies. In Les enquétes sur la 

_ mortalité infantile et juvénile (EMIJ), 1, 44-47. 

RIANDEY, B. (1985). L’enquéte ‘‘biographie familiale professionnelle et migratoire’’ (INED, 1981). 


Le bilan de la collecte. In Migrations internes, collecte des données et méthode d’analyse. Départe- 
ment de démographie. Université de Louvain, 117-134. 


= 
a 
3 
& 
E 


“wee 


Survey Methodology, December 1987 73 
Vol. 13, No. 2, pp. 173-181 
Statistics Canada 


An Alternative Method of Controlling 
Current Population Survey Estimates 
to Population Counts 


K.R. COPELAND, F.K. PEITZMEIER, and C.E. HOY! 


ABSTRACT 


The CPS uses raking ratio estimation in post-stratification estimation to adjust sample estimates of 
population to census-based estimates of the population. An alternative procedure, using generalized 
least squares, is compared to the current procedure. 


KEY WORDS: Generalized least squares; Post-stratification; Raking ratio estimation. 


1. INTRODUCTION 


The Current Population Survey (CPS) produces labor force estimates for the total U.S. 
working-age civilian noninstitutional population, based on a monthly multi-stage probabili- 
ty sample of approximately 60,000 housing units in the U.S. Each month a rotating sample 
comprised of 8 panels (called rotation groups) of housing units is interviewed, with 
demographic and labor force data being collected for all civilian adult occupants of the sam- 
ple housing units. 

Monthly estimates are published, subaggregated by demographic characteristics. Estimates 
for other subaggregates of the population (states, families, veterans, wage and salary earners, 
persons not in the labor force, etc.) are also produced on a monthly, quarterly, and/or an- 
nual basis. 

Sample person weights are derived through the application of probability of selection, 
adjustment for nonresponse, and ratio adjustment to reduce the contribution to the variance 
due to the sampling of primary sampling units. A post-stratification estimation procedure 
adjusts the sample person weights so as to control the survey estimates of population to in- 
dependently derived estimates of the population. The resultant weights are used in a com- 
posite estimation procedure and then seasonally adjusted to produce national estimates 
(Hanson 1978). 

Detailed estimates for certain population subdomains (families, wage and salary earners, 
persons not in the labor force, family earnings, and veterans) make use of sample weights 
derived from adjustment procedures built on top of the post-stratification estimation. 

The use of a generalized least squares (GLS) approach could potentially be used in place 
of post-stratification estimation or to integrate the various CPS adjustment procedures. The 
use of GLS has been proposed and investigated for use in the Consumer Expenditure Survey 
(Zieschang 1986). 

This article discusses and compares the current CPS post-stratification estimation (which 
uses raking ratio estimation) and the GLS procedure, based on two months’ CPS data (July 
1983 and July 1984). Both macro and micro level data were examined to evaluate differences, 
if any, in the two procedures in this application. 


KR. Copeland, F.K. Peitzmeier, and C.E. Hoy, Division of Statistical Methods, Office of Employment and 
Unemployment Statistics, Bureau of Labor Statistics, Washington, D.C. 20212 U.S.A. 


174 Copeland, Peitzmeier and Hoy: Alternative Weighting for CPS 


2. CURRENT CPS POST-STRATIFICATION ESTIMATION 


The CPS post-stratification estimation uses raking ratio estimation (RRE) to adjust the 
sample weights within a rotation group so as to control the sample estimates for the popula: 
tion to independently derived estimates of the population in each of three categories (state 
age/sex/ethnicity, age/sex/race). 

The methodology for RRE was first proposed by Deming and Stephan (1940) as an iterative 
alternative to least squares adjustment of table data. The RRE procedure has been shown t¢ 
produce best asymptotically normal (BAN) estimates under simple random sampling, and tc 
minimize the adjustments made to the sample weights based on one measure of closeness, a! 
discussed in subsection 4.2 (Ireland and Kullback 1968). In addition, RRE, although producing 
biased estimates, can sometimes be effective in reducing the mean square error of survey} 
estimates. This is believed to be the case in the application of RRE for CPS (Hanson 1978) 

For the CPS, the RRE procedure attempts to adjust the sample counts {n;,} obtainec 
from previous stages of weighting to adjusted sample counts {7;,} under the condition that 


(A) yy fie = Mi. 
jk 
i,k 
(C) 3 Nig = MK 
i,j 
be satisfied simultaneously, 
where / = State (1d & ROS) 2 | 
J = TaAge/ SEX Claicityn (ye =a leu LO). 
k SMASEASEX/TACEN( Ka =I1e We O)y 
m; = independent state estimate, 
mj. independent age/sex/ethnicity estimate, 
mM x independent age/sex/race estimate. 


The RRE procedure proportionately ratio adjusts the sample data each way (i.e., state’ 
age/sex/ethnicity, and age/sex/race) of the table in successive steps, as follows. | 


(1) Ratio adjustment by state: 
Liye Ama 
pete el Uae aR NOE aah Ne icte 


(2) Ratio adjustment by age/sex/etchnicity: 
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where 7; = sample row total 
nj, = sample column total 
n., = sample layer total. 

The completion of the three adjustment steps constitutes one iteration of the raking pro- 
cess. The three steps are repeated substituting the current value of nip? (adjusted sample 
count following the third way rake of the A-th iteration) for Nix in step (1) each time until 
6 iterations are completed. (The number of iterations used in CPS was determined based 
on the convergence properties of the RRE for CPS and the relative gains achieved by number 
of iterations.) The final {nj,‘°*)} is taken as {ftj,}. 

In order to adjust the sample weights, the adjustment factor for sample records in cell 
{ijk} is 


se (6,3) 
Prije = Mig”? / Nix 


6 
= Il a;‘”) Db ae). 


h=1 


The sample weights prior to RRE are multiplied by the appropriate Fix to obtain the ad- 
justed weights. 


3. APPLICATION OF THE GLS IN THE CPS 


The generalized least squares (GLS) procedure adjusts the sample weights from prior stages 
of weighting by minimizing the weighted squared adjustments, subject to a set of linear ‘con- 
trol’ constraints the adjusted weights must satisfy. This is the problem which Deming and 
Stephan attempted to address in developing the RRE. The GLS procedure, like RRE, pro- 
duces BAN estimates under certain conditions, in this case when all the cells are nonempty 
(Neyman 1949). GLS, by definition, minimizes the adjustments to the sample weights based 
on one measure of closeness (see subsection 4.2). 

For the CPS, each dimension that defines a set of controls in the current post-stratification 
will define a set of linear constraints for the GLS procedure. The function to be minimized is 


lia) ie ——P\eks AC P) 


ye (Wy; — Wi)*/ Wii, 


l 


subject to X’ F = N, 


Where F = (n X 1) vector of derived final weights (W,;) for each of the n sample 
persons, 
P = (n X 1) vector of sample person weights prior to post-stratification (W,;), 
Po = (n X n) diagonal matrix with the W,; on the diagonal, 
X = (n X k) design matrix whose rows correspond to sample persons, and whose 


columns correspond to control cells. The entries of the matrix (xj) are 0’s or 
1’s, indicating the appropriate control categories for each of the n sample 
persons. 

(k X 1) vector of independent population estimates, corresponding to the col- 
umns of X. These estimates are the same as those used in the CPS RRE. 
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The columns of X are required to be linearly independent so that an inverse of the matrix 
(X’ Py X) is achievable. In setting up matrices X and N for CPS, the 137 control cells us- 
ed in the RRE (state, age/sex/ethnicity, age/sex/race) were reduced toa set of k = 132 linearly 
independent cells. 

The unique solution to X’F = N that minimizes f(F) is, as shown in Luery (1986) 


F=P+P)X (X' Py) X)' (N-X P) 


Although the elements of F are not constrained to be positive, in this application of GLS 
for CPS, the elements of F were all positive without the need for additional constraints. 
Methodology for providing non-negative weights in this context is discussed in Huang and 
Fuller (1978) and Zieschang (1986), among others. 


4. RESULTS 


4.1 Macro-Level 


a. Estimates | 

Labor force estimates were tabulated for several demographic groups for July 1983 and 
July 1984, using the final weights derived from RRE and GLS. Standard errors for both 
RRE and GLS were calculated using a random group estimator of the form Wolter (1985) 


| 
8 | 
yy (8% — ¥)*/56, | 
=1 


where Y, = sum of the weights for sample records from the k-th rotation group with the 
characteristic Y, | 
Y = sum of the Y,. 


This variance estimator, while not accounting for the multi-stage design of the CPS, was 
used due to the unavailability of design information on the CPS public use microdata file 
Relative differences were calculated for both estimates of level and estimates of standar¢ 
error. The relative difference was defined as: | 


(Yors.— Yrre) / Tepe 


where Yrerr = estimate of Y based on the weights derived through the use of RRE, 
Yors = estimate of Y based on the weights derived through the use of GLS. 


As the data in Table 1 indicate, neither weighted labor force estimates nor estimates 0! 
standard error based on the current CPS RRE procedure and the GLS procedure showet 
any noticeable differences or trends when subaggregated to the sex by race/ethnicity level 

For labor force estimates by sex by race/ethnicity the estimated absolute relative difference’ 
between the CPS RRE and GLS estimates were all less than 0.3% (well below the estimate: 
CVs of each estimate). For the majority of these estimates, in particular for total and whites, 
the absolute relative difference was less than 0.1%. 

For many of the characteristics the sign of the relative difference changed from 1983 te 
1984; thus there does not appear to be a pattern to the differences in the estimates obtainer 
from the two procedures. 
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Table 1 
Labor Force Estimates by Sex/Race or Ethnicity 


errr ee 
1983 1984 
(GLS-RRE)/ (GLS-RRE)/ 
GLS RRE GLS RRE 


Total Sees Total S.E. Total Sees Total Sylec 
(000) (000) (%) (%) (000) (000) (%) (%) 


Total Emp 103516 403 0.00 — 0.14 107535 352 — 0.01 ete, 
UE 10669 221 — 0.04 — 0.75 8765 118 — 0.06 — 0.21 
Rate 9.34% 0.19% — 0.04 — 0.56 7.54% 0.09% — 0.05 0.27 
NILF 59938 373 0.01 — 0.68 60080 419 0.02 0.41 
White Emp 91338 344 0.00 — 0.33 94417 274 0.00 0.70 
UE 7928 236 0.00 —0.27 6282 120 0.00 —0.14 
Rate 7.99% 0.23% 0.00 — 0.26 6.24% 0.10% 0.00 — 0.16 
NILF 51915 340 0.00 — 0.36 51700 358 0.00 0.39 
Black Emp 9871 69 0.06 — 3.44 10371 98 0.02 0.17 
UE 2434 68 —0.12 — 1.07 2202 60 — 0.03 1.41 
Rate 19.78% 0.55% — 0.14 — 1.60 17.51% 0.42% — 0.04 1.49 
NILF 6628 26 — 0.04 — 1.47 6765 109 — 0.02 0.09 
Hispanic Emp 6132 73 — 0.03 —0.59 6607 102 — 0.03 1.90 
UE 920 79 — 0.05 — (0.29 786 70 — 0.08 — 0.03 
Rate 13.04% 1.10% — 0.02 — 0.33 10.63% 0.96 % — 0.05 O35 
NILF 3760 31 0.05 — 0.39 3786 73 0.04 1.02 
Male 
Total Emp 58985 147 0.00 — 1.58 61045 188 0.00 1.74 
UE 5980 134 —0.05 — 0.88 4682 79 — 0.02 0.77 
Rate 9.20% 0.19% — 0.05 — 0.79 7.12% 0.11% — 0.02 1.30 
NILF 17495 178 0.01 — 1.81 17840 214 0.02 0.64 
White Emp 52674 482 0.00 0.42 54261 111 0.00 0.34 
UE 4484 131 0.01 — 0.49 3394 93 0.01 —0.12 
Rate 7.84% 0.21% 0.00 — 0.47 5.89% 0.15% 0.01 — 0.13 
NILF 14985 160 — 0.02 — 0.40 15077 150 0.00 0.16 
Black Emp 5047 56 0.07 — 1.70 5263 84 0.01 — 0.50 
UE 1300 45 — 0.20 — 1.87 1137 33 0.08 a2 
Rate 20.49% 0.71% —0.21 — 2.02 17.76% 0.51% 0.05 0.94 
NILF 2097 40 — 0.04 —0.13 2236 88 — 0.07 — 0.48 
Hispanic Emp 3781 48 0.01 — 0.86 4064 79 — 0.02 1.29 
UE 534 45 —0.16 — 0.83 451 41 — 0.05 Ossi 
Rate 12.38% 0.99% —0.15 — 0.89 9.99% 0.95 % — 0.03 0.66 
NILF 981 42 0.00 — 0.42 964 57 0.07 1.40 
Female 
Total Emp 44531 320 — 0.01 — 0.01 46490 194 — 0.01 1.48 
UE 4689 107 — 0.04 —0.19 4083 88 —0.10 —1.22 
Rate 9.53% 0.23% — 0.03 — 0.02 8.07% 0.16% — 0.09 — 0.80 
NILF 42443 287 0.01 — 0.26 42240 27, 0.02 0.34 
White Emp 38664 315 0.00 — 0.29 40156 191 0.00 0.66 
UE 3444 115 —0.01 0.16 2888 68 0.00 — 0.32 
Rate 8.18% 0.28 % — 0.01 0.11 6.71% 0.15% 0.00 — 0.34 
NILF 36929 283 0.01 — 0.32 36623 214 0.00 0.53 
Black Emp 4824 57 0.05 0.56 5108 50 0.02 1.69 
UE 1134 46 — 0.02 0.07 1065 46 —0.14 — 0.62 
Rate 19.03 % 0.80% — 0.06 0.08 17.25% 0.67% — 0.13 — 0.63 
NILF 4531 24 — 0.04 2.99 4529 59 0.01 1.49 
Hispanic Emp 2350 44 — 0.08 — 0.46 2543 38 — 0.05 3.04 
| UE 385 41 0.10 0.51 335 34 —0.13 — 0.62 
Rate 14.08% 1.46% 0.16 0.57 11.64% 1.18% — 0.07 —0.11 
NILF 2778 33 0.07 — 0.87 2822 27 0.03 0.13 
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The absolute relative differences between the CPS RRE and GLS estimates of standar¢ 
errors for national labor force estimates were all less than: 1.9% for total population; 0.7% 
for whites; 3.5% for blacks; and 3.1% for Hispanics. 


b. Month-in-Sample Indexes 

It is a well-documented fact that the estimates produced from the CPS final weights havi 
certain patterns of relative bias based upon the time the rotation group has been in samplh 
(Bailar 1975). Month-in-sample indexes 


Ta (8 Yeny) Xst00, 


were calculated for both July 1983 and July 1984 based upon both the RRE estimates am 
the GLS estimates. 

Month-in-sample indexes for labor force by race, labor force by sex, and labor force b 
ethnicity were virtually identical for estimates based upon the CPS RRE and GLS procedures 


4.2 Micro-Level 


a. Adjustments to Sample Weights 

Both RRE and GLS minimize some measure of closeness between the pre- and post- ac 
justment sample weights. For RRE the measure is (Ireland and Kullback 1968) | 
Ma, = ye W; In (W2;/ Wj;). 


i 


For GLS, the measure is (Luery 1986) 


Mz = 3 (W, — Wi)?/ Wii 


i 


where W,; = weight for sample record / prior to adjustment, | 
W,; = weight for sample record / following adjustment. 


Tabulation of the measures of closeness (summarized in Table 2) provided some interestin 
and, in some cases, puzzling results. The CPS RRE yielded smaller values for both measure 
The GLS procedure did tend to produce smaller values for the measures for certain subgroup 
most notably for blacks and Hispanics. It should be noted that the differences between th 
values for the measures for RRE and GLS were almost always less than 1%. 

Although Mz should be minimized through the use of the GLS procedure, the value ( 
Mz based upon the GLS weights for the total sample was greater than the value of Mg Hy 
the CPS RRE weights for 11 of the 16 rotation groups. 

In seeking a reason for this apparent contradiction, it was noted that the CPS RRE he 
yet to converge to the age/sex/ethnicity controls after six iterations. The extent of this nor 
convergence is very small; less than 1.0% for all control categories. However, given the di 
ference in Mz between the RRE and GLS, a change in the RRE sample weights of on! 
0.1%-0.2% could reverse the results. Rerunning RRE using 15 iterations, although still ni 
achieving convergence did provide indications that the slight lack of convergence of the RR' 
is the reason for the results for Mg. (It should be noted that the GLS procedure minimiz 
My, among the class of adjustment procedures yielding estimates that meet the populatic 
controls. Since the CPS RRE did not converge to the population controls, it is not a memb 
of this class.) 
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Table 2 


Comparison of measures of closeness 
based on 8 RGs for each year 
(# of RGs with RRE < GLS) 
a a a a eee 


Ma Mp 

1983 1984 1983 1984 
a a a se es 
Total 8 8 4 7 
White 4 7 3 4 
Black 3 3 1 1 
Hispanic 0 0 0 0 
Male 2 fk 1 5 
Female 8 8 8 8 


Although an adjustment procedure such as RRE or GLS may minimize some measure 
of closeness for the total sample, it does not necessarily minimize that measure of closeness 
for subaggregates of the sample which were controlled for (e.g., blacks, Hispanics, males). 
Given the use of controls, and the fact that the overall measure of closeness is being minimized, 
it would seem desirable to have an adjustment procedure produce small measures of closeness 
at the subaggregate level also. The GLS procedure yielded smaller measures in almost every 
rotation group for Hispanics, in many rotation groups for blacks, and in several rotation 
groups for whites and males. 


b. Comparison of Adjustments 

Both RRE and GLS determine adjustment factors within cells defined by the intersection 
of the marginal constraints. Each sample record within a cell receives the same factor. To 
compare the adjustments made by the two procedures, the factors determined for each sam- 
ple record by each procedure were compared using the following ratio 


RRE/GLS = [(W);/ Wi;) rrel / [(Woi/ Wii) ors) - 


This ratio indicates the relationship between the adjustments made to a sample person 
weight by the RRE and GLS procedures. For comparison purposes, values of RRE/GLS 
less than 0.95 or greater than 1.05 were used to denote differences in the adjustments made 
by RRE and GLS. 

For each set of independent population controls, ratios E/C (i.e., coverage rates), where 
Eis the sample estimate based on the sample person weights prior to post-stratification and 
C is the independent control, were derived. 

Within each set of controls (state, age/sex/ethnicity, age/sex/race) sample records were 
sategorized by their coverage rates. Table 3 provides the sample distribution by coverage 
‘ate categories and by the RRE/GLS values, as well as the proportion of records within each 
-Overage rate category that have the RRE/GLS values. 

The data in Table 3 indicate that, for each set of controls, sample records from popula- 
ion groups which were over- or under-covered to some extent by the survey (i.e., for which 
he coverage rate is not near 1) were more likely to be adjusted differently by RRE and GLS 
han were sample records in population groups adequately covered by the survey. 
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Table 3 
Comparison of RRE and GLS adjustments, 1984 


2 aca 


Proportion Proportion 

Coverage Proportion of Sample of Category — 

Control Rate of Total with RRE/GLS with RRE/GLS 

Marginal Category Sample <100 9501, > 1-05 <0.95 or >1.@ 
Age/Sex/ <0 ey, 0.007 0.057 0.219 
Race 0.7-0.8 0.022 0.116 0.136 
0.8-0.9 0.241 0.147 0.019 
0.9-1.1 0.699 0.504 0.019 
1.1-1.2 0.021 0.069 0.084 
lez 0.010 0.106 0.275 
Age/Sex/ <<()tif 0.010 0.078 0.198 
Ethnicity 0.7-0.8 0.014 0.032 0.058 
0.8-0.9 0.106 0.135 0.033 
0.9-1.1 0.869 0.741 0.022 

NIE 2 0.001 0.007 0.202 | 

Sites 0.001 0.007 0.373 | 

State <0W/ 0.056 0.068 0.031 

0.7-0.8 0.111 0.180 0.042 

0.8-0.9 0.278 0.325 0.030 | 

0.9-1.1 0.479 0.342 0.018 | 

ig el 0.026 0.009 0.009 | 

<if*,2 0.049 0.077 0.040 | 


4.3 Computer Resources 


The CPS RRE and GLS procedures were run on an IBM System 370 at the National In 
stitutes of Health using PROC MATRIX in the SAS System. The CPU time to prepare thi 
files and perform the weighting was approximately three times as much for the GLS pro. 
cedure than it was for the RRE procedure. There was also more storage of files involve 
with the GLS procedure. (The size of the matrices involved for CPS are quite large, witl 
the number of rows for P, Po, X, and N being around 14,000 for each rotation group. 


5. SUMMARY AND CONCLUSIONS 


This investigation was intended to provide a comparison of RRE and GLS as applied ti 
the CPS, at both the macro and micro level. 

The results obtained at the macro level do not indicate any difference in the estimate 
obtained from the RRE and GLS procedures. 

The measures of closeness indicated that the CPS RRE made slightly smaller changes overa_ 
to the sample weights to meet the control constraints than did the GLS. The CPS RRE tend 
ed to produce slightly larger measures of closeness for subaggregates of minority popula 
tions. The two procedures differ most notably in the adjustments made to portions of th 
population which are either over- or under-covered. ) 

Based on the work done in this investigation, it does appear that the RRE takes less com 
puter time to run for the CPS second-stage adjustment than the GLS. 
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A Class of Methods for Using 
Person Controls in Household Weighting 


CHARLES H. ALEXANDER! 


ABSTRACT 


A class of ‘‘constrained minimum distance’’ methods is considered for constraining household weights 
to be consistent with auxiliary information on the number of persons in various age X race X sex 
cells. The constrained weights are as close as possible to the initial weights based on the inverse pro- 
bability of selection. This class of methods includes raking and generalized least square methods, as 
well as multinomial maximum likelihood, (where the cells of the distribution are household types.) 
The properties of the methods in the presence of systematic undercoverage of the household types are 
studied through some simple models for coverage. Comparisons with the principal person method are 
made and the paper concludes with the observation that it is necessary to know more about the nature 
of survey undercoverage before deciding on which of the constrained minimum distance or principal 
person methods is to be preferred in applications. 


KEY WORDS: Weighting; Auxiliary information; Raking ratio estimation; Principal person method; 
Survey coverage. 


1. INTRODUCTION 


Post-stratification is commonly used to adjust survey weights to take into account indepen- 
dent information about the number of units of certain kinds in the population. For exam- 
ple, independent estimates of the population in various age X race X sex post-stratification 
cells may be available from adjusting census counts for known changes in the number of 
persons since the census. These independent estimates are often referred to as ‘“‘control 
counts’’. Prior to post-stratification, each sample person (or household) has an initial weight, 
typically corresponding to the inverse of the selection probability. A post-stratification ratio 
adjustment factor is applied to the weights of all sample persons in each cell, so that the 
sum of the adjusted person weights equals the independent control count for the cell. This 
adjustment is especially important when there is systematic undercoverage of households or 
persons within households. 

For most U.S. Census Bureau demographic surveys, post-stratification is used in assign- 
ing weights to sample persons, but is not used directly in assigning weights to sample 
nouseholds. This is due to the greater difficulty of obtaining independent estimates for 
10useholds. Instead, household weights for these Surveys are assigned using some version 
of the ‘‘principal person’’ method. In the basic principal person method, the household weight 
S$ set equal to the final post-stratified person weight of the ““principal’’ person in the 
1ousehold. The rule for identifying this person will be described in Section 2. By using the 
d0st-stratified person weight, the principal person method does incorporate the independent 
*stimates of persons into the weights assigned to households. 

The most obvious problem with the principal person method is that when the resulting 
10usehold weights are used to calculate weighted estimates of the number of persons in each 
d0st-stratification cell, with each person being given his or her household’s weight, these 
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estimates do not agree with the control counts used in the post-stratification. Consequently, 
there has been interest in methods of assigning weights to households which are constrained 
to produce person estimates which agree with the independent control counts. 

This paper considers a class of methods for assigning survey weights to households, con- 
strained to be consistent with the ‘‘known’”’ control counts in various person cells. The general 
idea is to find household weights which satisfy the constraints and are as close as possible 
to the initial vector of weights assigned to the households. The different methods within the 
class correspond to different ways of measuring the distance between the initial vector of 
weights and the adjusted vector of weights. 

Section 2 describes six ‘‘constrained minimum distance’’ weighting methods of this type 
plus a version of the principal person method. Three of the six methods have been investigated 
previously, and the others are added in this paper to round out the picture. Section 3 describes 
the computation of the weights. Section 4 discusses how the adjusted weight depends on the 
composition of the household. Section 5 discusses results and examples which may help in 
understanding what these methods do. Section 6 describes areas for further research. 

This work has numerous antecedents. The general class of constrained minimum distance 
methods is suggested for household weighting by Luery (1986). Extending Luery’s work, 
Zieschang (1986a) proposes using one of these methods, generalized least squares, for 
weighting the U.S. Consumer Expenditure Surveys. Another member of the class is the 
““minimum discriminant information method’’, otherwise known as raking ratio estimation 
or, simply, raking. Oh and Scheuren (1978a) specifically discuss the raking approach to the 
household weighting problem, and give additional references to a rich literature on raking 
and related methods. The idea of viewing raking as a constrained minimum distance pro: 
blem dates back at least to Deming and Stephan (1940). The fundamental principles of this 


approach are explored in Ireland and Kullback (1968). Applications to survey weight adjust: 
ment are well covered in Brackstone and Rao (1979). The class of methods also include: 
two criterion functions related to multinomial maximum likelihood. The relationship of thi; 
to raking has been extensively studied; see, for example, Bishop, Fienberg, and Holland (1976) 
Fienberg (1986) points out that the distance criteria considered in this paper may be viewe¢ 


as special cases of a parametric family of functions considered in Cressie and Read (1984) 


2. CONSTRAINED MINIMUM DISTANCE METHODS 


2.1 Methods Based on Household Weights 


Consider a sample of K households, whose initial weights are given by the | 
S = (S,...,Sx)’. In this paper, S, will be the inverse of the probability of selection 0, 
the k-th HOGEHOTE: in some applications other adjustments such as nonresponse factors ma, 
be included in the initial weight. 

Suppose that there are J post-stratification cells, and that the number of persons in th 
population (N;) is known for each cell. For example, for the U.S. Consumer Expenditur 
Survey, there are J = 48 cells corresponding to combinations of the two sexes, two race 
(black, nonblack), and twelve age categories. In that survey, persons younger than 1 
are not included. The control counts for these cells will be treated as a vecte 

Ss GIN, PA) TANG) 4 

ae sOripostion of the sample households will be described by a matrix A = (a 
where ay; is equal to the number of persons in the k-th sample household who are in th 
j-th post-stratification cell. Summing over the post-stratification cells for the k-th househol 
gives a,., the total number of persons in the k-th household. For household k, the vect¢ 
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(@x1, ..., 4,7) describes the composition of the household. For example, if the vector is 
(2,1,0,0,...,0), then the household contains exactly two persons in the first cell and one 
in the second. 

Using the initial weights S, the weighted sample estimate of the number of persons in cell 
j would be N; = Ly aS; or in general N = A’S. 

Typically N #N, i.e., the initial weighted estimate of persons in the post-stratification 
cells may not equal the known population of the cell. 

The goal is to define a new vector of weights W = (W,,..., Wx)’ for the sample 
households, so that N = A’W or 


Yo any We = Nj LONn aa lees (1) 
k 


The solution to (1) is not necessarily unique. The idea of the constrained minimum distance 
methods is to chose W so as to minimize some measure D(W,S) of the distance between 
the vectors W and S , subject to (1). In this way, the initial weights S are changed as little 
as possible in meeting the constraint that the adjusted weights should agree with the known 
control totals. Note that, for certain possible values Nj, ..., Ny), it may be impossible for 
any vector of weights W to satisfy the constraints (1). Practically speaking, this possible in- 
feasibility does not seem to be a problem, provided the sample is large enough to include 
a good representation of different types of households, since the controls N are generated 
from the actual population and therefore can be expected to be ‘‘feasible’’. 

There are numerous ways of measuring the difference between two vectors. Three distance 
criteria D(W,S) will be considered, corresponding to a household-level generalized least 
squares (GLS-H) objective function, a minimum discriminant information (MDI-H) func- 
tion, and a maximum likelihood estimation (MLE-H) criterion. The criteria are: 


GLS - H: Ne (W, — S;)?/Sz, (2a) 
k 
MDI - H: CSR WT SY IW e/ Sp), (2b) 
k 
MLE - H: (W.-S) - yy S, In(W,/S,). (2c) 
k 


Throughout the paper, the dot notation is used to denote summation over a subscript. 
In each case D(W,S) is nonnegative and is equal to zero if and only if W = S. This can 
be shown, in the usual way, by examining the first and second partial derivatives of each 
expression with respect to the W,. 
Algorithms for calculating W to minimize these three criteria, while meeting the constraint 
(1) to the degree of approximation desired, will be discussed in Section 3. 


2.2 Methods Derived from Person Weights 


An alternative approach to this problem leads to a slight but important modification of 
the three distance criteria. These modified criteria are given by (Sa), (5b), and (Sc) below. 
Although these criteria lead to weights for households, they are generated by an approach 
which starts out by trying to define weights for persons. Accordingly, first consider the pro- 
blem as one of defining person weights as close as possible to their original household weights, 
subject to the constraint that the weighted estimate of persons in each post-stratification cell 
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equals the known control. Let the persons in the k-th household be numbered = 1, ..., a. 
and let S;,; be the initial weight of the /-th person in the k-th household; note that S;; = Sx, 

Let b,;; be a zero-one indicator variable showing whether the i-th person in the k-th 
household is in the j-th post-stratification cell. Then the condition for consistency with the 
controls is 


Sy ye Dei Wig = Nj. (3) 
are; 


The three criteria for the person weighting problem would be 


»S My (Wri — Sui)? / Skis (4a) 

Dear ALs es a ye 2 Wii In( Wii / Sx), (4b). 
ie as | 

RES LA My DB Sxi IN (Wai / Ski) - (4c) 
k i i 


These criteria could be used for defining person weights. In fact the criterion (4c) would 
lead to the post-stratification weights which are used in person weighting for the Consumer 
Expenditure Survey, as described in Alexander (1986). However, our problem is to define, 
weights for households. Household weights may be obtained from these criterion functions. 
by imposing upon the person problem the additional constraint that all persons in the same’ 
household must have the same weight. Therefore, let W,; = W, fori = 1, ..., a. Under, 
this constraint, (3) becomes 


Ni= , (53 bu) Wyte Ne ayy Wes 
k i k 


which is the same as the constraint (1) in Section 2.1. The distance criteria (4a), (4b), and 
(4c) now become: 


GLS-P: Nop fehl samo Seales (Sa). 
k 

MDI-P: a See eee Va tae ier nn a) (Sb) 

k k | 

MLE-P: S a, Wr - is ay. Sp — ve Gy, S, lh CW/ Sp). (Sc): 

k k I 


The criteria are now summations at the household level, but the household size a,. has been 
brought into the criterion for measuring the distance between the initial and adjusted vector 
of weights. These criteria will be seen to have advantages over the more direct approach whide 
led to (2a), (2b), and (2c). 
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2.3. The Principal Person Method 


In the basic principal person method, the post-stratified person weight of the household’s 
“principal person’’ is used as the household’s weight. To determine the principal person, 
it is first necessary to determine the household’s ‘‘reference person’’. The reference person 
is identified by the interviewer as the first person mentioned in response to the instruction 
“start by giving me the name of someone who owns or rents this house.’’ Household rela- 
tionships are defined in terms of the other members’ relationship to this reference person. 
“Reference person’’ has replaced the ‘‘head of household’’ concept for this purpose. 

The principal person is the wife of the reference person if the reference person is a mar- 
ried male with spouse present. Otherwise, the principal person is the reference person himself 
or herself. The rationale for this choice is that the principal person should be a person who 
is not likely to be missed due to within-household undercoverage. In general, women have 
better coverage than men. Further, the principal owners or renters of the house or apart- 
ment seem unlikely to be overlooked. 

The basic idea of the principal person method is that there is exactly one principal person 
in each household. Consequently, the number of households may be estimated by estimating 
the number of principal persons. This basic method is used for the U.S. National Crime 
Survey. Other surveys such as the U.S. Consumer Expenditure Surveys or Current Popula- 
tion Survey, make additional adjustments based on assumptions about within-household 
undercoverage of principal persons, as compared to other persons in the same post- 
stratification cell (Alexander 1986.) 

The principal person method is difficult to model theoretically because the designation 
of the reference person is somewhat arbitrary. In the hypothetical examples of Section 5, 
a simplified version of the principal person method will be used, in which the principal per- 
son is the household member whose post-stratification cell has the best coverage, i.e., whose 
post-stratification factor is closest to one. A similar idea is used in Scheuren (1981). 

This simplified principal person method will be represented symbolically as follows. For 
the k-th sample household, let j(k) be the post-stratification cell of the household’s prin- 
cipal person. Then the household’s principal person weight is 


Wy, = Si (Nice) / Nic): 


3. COMPUTATION OF THE WEIGHTS 


The two least squares methods, GLS-H and GLS-P, have closed-form expressions for W, 
providing that there exists some solution to the constraints (1). For the GLS-H weights, the 
adjusted weights are given by 


W = S + MA(A’MA)7~! (N — A’S) (6) 


Where S = (S),..., Sx), N = (M,...,N,), A is the matrix (a,;) and M is the K x K 
diagonal matrix with the elements of S on the main diagonal. The weights W for the GLS-P 
method are also given by (6), except that M is the K x K diagonal matrix with the values 
S;/a,,,..., Sx/ax. on the main diagonal. 

A disadvantage of (6) for either method GLS-H or GLS-P is that the solution W may 
include negative weights. Conceptually this is unsettling, and for practical users negative 
weights are unacceptable. It is usually possible to incorporate additional constraints that the 
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weights must be positive. Ways of doing this are given by Zieschang (1986a) and Huang and 
Fuller (1978). However, the advantage of a simple closed-form solution is lost with these 
additional constraints. 

The raking method (MDI-P) has been used before for household weighting, e.g., by Oh 
and Scheuren (1978a). A related method which has been extensively tested is described in 
Pugh, Tyler, and George (1976), based on the approach of Stephan (1942). Luery (1986) 
gives an iterative algorithm based on Darroch and Ratcliff (1972), which is proved to con- 
verge whenever there is a solution to (1). This method is presented here, since the iterative 
step has a simple interpretation. The iteration starts with ‘‘step 0’’ weights 


W,(0) = S(N,/N.) 


In other words, the initial weight S;, is adjusted by an overall inflation factor equal to the 
known population N. divided by the initial weighted total population. At subsequent iterative 
steps, the adjustment is 


BES. aah toh bse Lek (7 ie ) ee 
J 5 


Note that W,(i — 1) is multiplied by the geometric mean of the post-stratification factors 
for the persons in the k-th household, where the post-stratification factors are calculatec 
using the weights after iteration 7 — 1. | 

The other three methods, MDI-H, MLE-H, and MLE-P, have not been extensively studied) 
The following iterative algorithms have worked successfully in small hypothetical examples 
such as those given in Section 5. In each case, a system of equations, which the weights musi 
satisfy in order to minimize the distance criterion subject to the constraints, can be founc 
by the use of Lagrange multipliers. The equations cannot be solved directly, but if an iterative 
method produces solutions of the proper form, then the solution minimizes the criterion. 
If the algorithms converge, the solutions will satisfy the equations. However, the author ha: 
no general proof of convergence. A possible alternative approach for the ‘‘maximum 
likelihood’? criteria would be to apply the approach of Haber and Brown (1986). Other relatec 
work is Fagan and Greenberg (1985). 


3.1 Method for MDI-H 


The equation for the weights is 


= Sx aM Vi akj (7 
| 
subject to (1). If values y,, ..., y,; can be found so that the weights calculated according 
to (7) satisfy (1), then those weights minimize (2b) subject to (1). An iterative algorithm fo. 
generating such a vector W is as follows. 
Initialize W,(0) = S, and y;(0) = 1. Then at the /-th iteration let 


(= yi) | ~ OYE) NOY a Me 0 |, 


RY 


where Nj(i — 1) = )) ayy W.(i — 1). Then let Wy (i) = Se TT (yi). 


RY 
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3.2 Method for MLE-H 


The solution is of the form: 


Wy= Sp/ (: rune 14x). 
J 
subject to (1). 
An iterative solution is 


W,.(0) = St and vj (0) = 0, 


yl) = yi — 1) + (NG - 1) — ND/( YD (ay WG = 1N)*/S), 


S 


WG SSP (: + De 7; (i) ay): 
i 
3.3 Method for MLE-P 


The solution is of the form: 


W, = si/( e Vkj au.) 
J 
ubject to (1). 
An iterative solution is 


W,.(0) = Sx and 7; (0) ses 
pai Credle al Ga ALAN 


W,(i) = si/( YS v) aut.) 
J 


4. THE ROLE OF A HOUSEHOLD’S ‘“‘COMPOSITION TYPE” 


For the six constrained minimum distance methods, the ratio of a household’s initial weight 
0 its adjusted weight depends on the number of people in the household in the different 
Ost-stratification cells. To discuss this further, the notion of a household’s ‘‘composition 
ype’ will be introduced. Two sample households, say k and m will be said to ‘‘have the 


ame type’’ if they have exactly the same number of people in each of the post-stratification 
mils, i.e., if 


Bi) Gap A Olay A= elise alas od (8) 


\s an example, one household type would be a ‘‘household consisting of a white male 35-39 
md a white female 30-34.’’ Note that the composition type does not depend on family rela- 
ionships. 

The ratio of the adjusted weight to the initial weight, W,/S,, is the same for all house- 
lolds with the same type. In other words, if k and m satisfy (8), then W,/S, = W,,,/S,,). 
This fact was used in Ireland and Scheuren (1975). A formal proof is given in Alexander 
ind Roebuck (1986). 
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A useful consequence of this fact is that, in calculating the weights for the constrained 
minimum distance methods, the calculations may be done using the household type as the 
unit of analysis rather than the individual household. A simple example may make the im- 
plications of these results clearer. Suppose that there are two post-stratification cells, 7 = ] 
for females and j = 2 for males. The sample consists of K households. For household k. 
the vector (a, a.) describes how many females and males are in the household; ¢ 
household with vector (2,1) has two females and one male. 

Practically speaking, there is some upper limit on the size of a household, and there are 
only finitely many household types. For the example, assume that no household has more 
than three people. Then there are T = 9 household types corresponding to the vectors: (1,0) 
(0,1), (2,0), (1,1), (0,2), (2,1), (1,2), (3,0), (0,3). These types will be numbered consecutively 
t = 1,...,9. The types will also be labelled mnemonically, F, M, FF, FM, MM, FFM, FMM 
FFF, MMM. Hypothetical sample data and control totals are given in Table 1. Note thai 
S, is the total initial weight given to households of type ¢. | 

The constrained minimum distance adjustments effectively may be calculated from the 
total weights for the household composition types, S;,..., So, without actually looking a 
the individual household weights. Adjusted weights W,, ..., Wo may be calculated using the 
algorithms from Section 3 replacing summation over k by summation over ¢. Then for any 
type ¢ household, the adjusted weight given by the method is W,/S, times the initial weigh 
for the household. (The potentially confusing notation of using S;, for the household weigh, 
and S, for the total weight for a t household type is adopted to emphasize that the formula; 
of Sections 2 and 3 apply equally well to households or household types. In doing calcula 
tions, the meaning will be clear from the context.) | 

The reduction of the problem from individual households to household types is extremel: 
convenient for presenting small examples. Even when applied to the full 48 post-stratificatioi 
cells, the household-type approach may still be practical: despite the astronomical numbe 
of possible household types, the actual number of types in the sample can never be large 
than the sample size and often is substantially smaller. This was found to be the case fo! 
related cells of households in Ireland and Scheuren (1975). Simply reducing the size of th: 
computational task by combining the weights for single-person households of the same typ. 
may be useful; this has been done at the U.S. Bureau of Labor Statistics in applying | 
generalized least squares method to the Consumer Expenditure Surveys. 

The simplified version of the principal person method also depends only on the householi 
type. If two households have the same composition, then their principal persons will be 1) 
the same post-stratification cell, the one with the post-stratification factor closest to one) 
Consequently, the same ratio adjustment factor would be used for both households. In th’ 
actual principal person method, the principal person depends in part on who happens t: 
be designated as reference person, so the adjustment factor is not completely determine 
by the household’s composition type. | 

Note that the MLE-H method corresponds to calculating multinomial maximum likelihoo. 
estimates (subject to the constraint (1)) of p, ¢ = 1,..., T, where p, is the population pre 
portion of households with type t. The MLE-P method has a related interpretation. Neithe 
of these models, which also pertain to the corresponding GLS and MDI methods, allow 
for systematic undercoverage. 


5. DISCUSSION OF THE METHODS 
This Section begins with some speculations about properties of the constrained minimur 
distance methods, based on the results of Section 4, and follows with some simple hypotheticé 


examples, which generally appear to support the speculations. 
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The first conjecture is that MLE-H, GLS-H, and MDI-H will tend to give similar results, 
and also that MLE-P, GLS-P, and MDI-P will tend to be similar to one another, at least 
for large samples. This is based on the observation that these are all best asymptotic normal 
estimators under the relevant multinomial sampling model, where the cells are the household 
types. For small or moderate sample sizes, greater differences between the methods might 
be anticipated, especially if there are a large number of household composition types, so 
that the sample in individual ‘‘cells’’ of the multinomial may be small. 

The examples given below tend to support this conjecture; the ‘‘household’”’ methods all 
give very similar results, as do the ‘‘person’’ methods. This is true even in some cases when 
the hypothetical data do not fit the model very well. However, these examples involve only 
a small number of household types and post-stratification cells, and so are illustrative rather 
than conclusive. 

The second conjecture is based on considering the nature of the sampling models under 
which the constrained minimum distance methods may be viewed as maximum likelihood 
estimates, or asymptotic approximations thereto. In these models, perfect coverage is assumed. 
The models assume a distribution corresponding to probabilities which are the actual pro- 
portions in the population, and these probabilities are consistent with the ‘‘true’’ control 
totals used in the constraints (1). According to these models, for sufficiently large samples, 
the initial sample estimates would approach agreement with the control totals. This would 
not be true when there is substantial undercoverage in the sampling frame. Such undercoverage 
is an important reason for using post-stratification. Coverage considerations may be especially 
important for telephone surveys where there is no supplemental frame to include households 
without telephones. If there is no special adjustment for noninterview ‘‘nonresponse’’, such 
as refusal or inability to provide the requested information, then nonresponse may be a fur- 
ther departure. 

Based on these remarks, the second conjecture is that without adjustment the constrained 
minimum distance methods may not perform well in adjusting for systematic undercoverage, 
even for large samples. The methods are optimal under models which assume perfect coverage; 
one would expect that they might be less than optimal when this assumption is violated. 

The examples given below partly support this conjecture. The constrained distance methods 
do not do as well as the simplified principal person method under certain assumptions about 
undercoverage. Under other assumptions, some of the methods may do quite well. The author 
concludes that it is necessary to know more about the nature of survey undercoverage before 
judging that any of these methods is superior to the principal person method. Oh and Scheuren 
(1978b) raise some related issues about mean square error of the raking estimator when there 
is undercoverage. 

Two examples will be presented, representing two extreme forms of undercoverage. The 
first (“‘household undercoverage example’’) will assume that there is a uniform 10% under- 
coverage of all households, but that there is no within-household undercoverage. The se- 
cond example (‘‘within-household undercoverage example’’) assumes a 10% undercoverage 
of males due to within-household undercoverage in households where there are both males 
and females, and undercoverage of all-male households. For single-person households, any 
“within-household undercoverage’’ means that the whole household is missed. 

In example 1, there is a 10% under-representation of all types of households in the sam- 
ple. For a sufficiently large sample, this would obviously be due to systematic undercoverage, 
rather than sampling error. Applying the constrained minimum distance methods and the 
principal person method to this example gives the total adjusted weights for each household 
type shown in the last four columns of Table 1. 

Note that the GLS-P, MDI-P, and MLE-P methods all bring the adjusted weight up to 
the actual population value. Thus, these methods give ‘‘unbiased’’ weights. Since all per- 
sons have a second-stage factor of 1 /.9, the principal person method also achieves this result. 
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Table 1 


Household Undercoverage Example: 
Description of Population and Sample 


Total Weight (W,) for Methods: 


GLS-F 
MDI-F 
Total MLE-} 
Type & Actual Initial Prine 
description Population Weights GLS-H MDI-H MLE-H Pers 
|W 25,000 22,500 23,785 23,145 23,704 25,008 
2: M 15,000 13,500 14,120 14,097 14,075 15,001 
37°F FE 7,000 6,300 7,020 7,016 HOTS 7,001 
4: FM 40,000 36,000 39,708 39,672 39,632 40,00¢ 
5: MM 5,000 4,500 4,913 4,906 4,900 5,001 
6: FFM 12,000 10,800 12,529 12,506 12,594 12,00 
7: FMM 12,000 10,800 12,408 12,428 12,449 12,001 
8: FFF 0 0 0) 0 0 ( 
9: MMM 0 0 0 0 0 ( 
Total 116,000 104,400 114,483 114,370 114,367 116,00 
Control Totals: Number of Females = 115,000 
Number of Males = 101,000 
Initial Weighted Females = iOs,s500 
Person Counts: Males = 90,900 | 


: 


il 
The other methods, GLS-H, MDI-H, and MLE-H, all give substantially too little weigh’ 
to one-person households and too much to the three-person households. Intuitively, this make: 
sense; since these methods do not allow for systematic undercoverage and must explain thi 
shortage of sample persons as sampling error, the obvious explanation is that the sampk 
has a below-average number of large households, due to chance. The better performance 
of MLE-P makes some sense, since it starts out with a multinomial sampling model whict 
allows sampling of persons without regard to households. | 
Practically speaking, this example reflects very poorly on the GLS-H, MDI-H, and MLE. 
H methods. Even uniform undercoverage would cause these methods to distort the distribu. 
tion of household sizes. Worse, the distortion goes opposite from what is commonly assum 
ed about differential household coverage, namely that small households are more likely t¢ 
be missed than large ones, so that small households need relatively higher weights, not relative. 
ly lower weights. | 
The second example will emphasize within-household undercoverage of males. The situa’ 
tion is more complicated than in the previous example, because a household may have at’ 
apparent composition type different than its actual type. For example, a household whict 
actually consists of a male and a female may appear to be a single-person household. The 
actual and apparent type will be indicated by modifying our previous notation. For exam: 
ple, a FM household in which the male is missed will be denoted F/M]. A |M] househol¢ 
or [MM| household is missed entirely. Table 2 describes the hypothetical data. The actua’ 
population is the same as in the previous example. 
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Within-household Undercoverage Example: 
Description of Population and Sample 


ST 


: 
: Table 2 
. 


- Actual Total 
in Apparent Actual Initial 
Type Type Number Weights 
i; F F 25,000 25,000 
2; M M 13,500 13,500 
[M| 1,500 0 
ia FF FF 7,000 7,000 
4: FM FM 36,000 36,000 
F(M| 4,000 4,000 
5: MM MM 4,500 4,500 
[MM] 500 0 
/6: FMM FFM 10,800 10,800 
| FF|M| 1,200 1,200 
/7: FMM FMM 10,800 10,800 
FM[M| 1,200 1,200 
| 8; FFF FFF 0 0 
9: MMM MMM 0 0 
116,000 114,000 
Control Counts: Number of Females 115,000 
Number of Males 101,000 
Initial Weighted Females 115,000 
‘Person Counts: Males 90,900 


Note that there is a 10% undercoverage of males, due to missing males within households, 
or missing all-male households. Each male has a 10% chance of being missed. 

Neither column of numbers in table 2 is observed, since there are no household controls. 
Also the actual household type is not known for the sample units. Thus, the [FM] households 
‘appear to be the same as the F households. The data which would be observed are given 
in Table 3, along with the total initial weight for households which appear to have a given 
‘type. The adjusted weights are given for three methods, MLE-H, MLE-P, and principal per- 

son. The results for GLS-H and MDI-H are fairly close to MLE-H, and GLS-P and MDI-P 
are similar to MLE-P, so these other methods are omitted. 

The last three columns of Table 3 show the total adjusted weight assigned to each actual 

| household type by the MLE-H, MLE-P, and principal person methods. The principal per- 
son weights for each actual household type agree with the population counts for the actual 
types, shown in the third column of Table 1. In this sense, the principal person weights are 
| unbiased. 
__ This example corresponds to assumptions upon which the simplified principal person is 
based. The principal person adjusted weights for each actual type of household coincide with 
the population counts. The one difference is that totally missing |M] or [MM] households 
are given no weight; however, the weight of the non-missing M or MM households is in- 
| creased accordingly. The total weighted number of households for the principal person method 
: is equal to the number in the population. 


) 


| 
] 


} 
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Table 3 


Within-household Undercoverage Example: Observed Types and Weights, 
with Adjusted Weights from Three Methods 


Weight Assigned to Weight Assigned to 
Apparent Type Actual Type 
Total 

Household Initial Principal Princip! 
Type Weight MLE-H MLE-P Person MLE-H MLE-P Persor 
F 29,000 27,450 26,973 29,000 23,664 235255 25,00, 
M 13,500 14,997 16,338 15,000 14,997 16,338 15,00: 
EP 8,200 7,368 7,626 8,200 6,290 6,510 7,00, 
FM 37,200 38,887 897128 37,200 41,419 41,586 40,00 
MM 4,500 5,023 5,446 5,000 5,623 5,446 5,00. 
FFM 10,800 10,661 10,885 10,800 ETS 739 12,001 12,00 
FMM 10,800 12,605 11,878 10,800 13,859 13,140 12,00. 
FFF 0 0 0 0 0 0 | 

MMM 0 0 0 0 0 0 
Total 114,000 LES9t 118,274 116,000 117591 118,274 l 16,00, 


In this example, the constrained minimum distance methods overestimate the total numb: 
of households, but give too little weight to the households without males. In general, tc 
much weight is given to households with males. 

It should not be concluded that the principal person method always outperforms the col 
strained minimum distance methods when there is within-household undercoverage. Und 
other assumptions about coverage, the principal person method may not do so well. In fac 
different versions of the principal person method are used for different surveys, based ¢ 
various assumptions about coverage. Note also that combinations of the principal pers¢ 
method and raking methods are possible; see Scheuren (1981). 

Even in this example, the biased weights assigned by the constrained minimum distam: 
methods could be beneficial for estimating some characteristics. If the households in whic 
males are missed tend to under-report the variable of interest, then giving these househok 
too high a weight may tend to counteract response bias associated with the within-househo: 
undercoverage. 

The most extreme example of this effect is estimation of the total number of males, 
which case the MLE-H and MLE-P weights give estimates which agree with the control toté 
while the principal person weights do not. However, for household characteristics where the 
would rarely be reporting errors because of the missed male, such as form of tenu 
(renter/owner), the biased weights would not be desirable. The performance of the weightil 
methods in situations like these clearly depends on the nature of the survey undercoverag: 
and its relationship to the variable being estimated. This is discussed further, with addition 
examples, in Alexander and Roebuck (1986). 

Pending further research on survey coverage and its effect on weighting, what recomme. 
dations can be made? Among the constrained minimum distance methods considered in tk 
paper, GLS-H, MDI-H, and MLE-H seem unattractive because of their failure to adju 
correctly for uniform undercoverage of households. This is in spite of the fact that, if the 
were no undercoverage, MLE-H seems to be based on a more sensible model than MLE-' 
since households rather than persons are the ultimate sampling unit. 


| 
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The possibility of negative weights raises questions about the appropriateness of GLS-P, 
even though in some practical applications (such as Zieschang 1986b) there are very few 
negative weights, so that they could be replaced by positive weights with little effect on the 
estimates. That leaves MDI-P and MLE-P. Our results give little basis for choosing between 
these methods. Computational considerations tend to favor the ‘‘raking’’? method MDI-P. 
Based on limited experience with the algorithms of Section 3, the MLE methods converge 
Imore slowly than the MDI methods. Further, there has been considerable research into ways 
to improve the efficiency of raking for large-scale applications, such as Ireland and Scheuren 
(1975). Taking all this into account, the raking method, MDI-P, seems to be the most pro- 
mising of the constrained minimum distance methods. 
__ The constrained minimum distance methods give household weights which are consistent 
with control totals for person, unlike the principal person method. However, the superiority 
of the constrained minimum difference methods over the principal person method as an ad- 
justment for undercoverage is far from obvious. Undercoverage is an essential part of the 
survey weighting problem. The principal person method is an ad hoc solution to the under- 
coverage problem, based on some very simplistic assumptions about coverage. However, as 
seen in Section 4, the constrained minimum difference methods may be viewed as ‘‘optimal’’ 
(.e., maximum likelihood or the asymptotic equivalent) estimators under models which assume 
perfect coverage. The choice is thus between an optimal solution to the wrong problem and 
an ad hoc solution to what may or may not be the right problem. Clearly more research 
is needed. 


{ 


! 


6. SOME AREAS FOR FURTHER RESEARCH 


6.1 Household Control Totals 


If independent estimates of the number of households of different kinds were available, 
then ordinary post-stratification could be used for household estimates. Household controls 
by size of household are being investigated, based on updating 1980 census results (Das Gupta 
et al. 1986). The availability of household controls would fundamentally change our ability 
to deal with the household weighting problem. 

Even with household controls, it might be beneficial to also incorporate person controls. 
The household controls are not likely to include detailed information on the age, race, and 
sex of the household members. The use of raking to simultaneously control the estimates 
to independent controls for persons and households is developed by Scheuren (1981), using 
an estimate of the total number of households. Zieschang (1986a) describes how similar ad- 
justments may be made using generalized least squares. 

Household controls clearly have great potential for adjusting for differential coverage of 
various types of households. There still may be problems is dealing with within-household 
undercoverage, since this may lead to errors in determining the true household size, which 
would cause sample households to be placed in the wrong post-stratification cell. 


6.2 Research Concerning Coverage 


Coverage of persons is measured fairly well by comparing the initial survey estimates N; 
to the control totals N,. It is difficult to determine how much of this undercoverage is due 
to missing entire households and how much is due to missed persons within households. Ad- 
ditional information could be obtained by comparing initial weighted household estimates 
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to household controls, once these controls become available. In the meantime, 1980 surv 
estimates by type of household could be compared to the corresponding 1980 census count 

Even with this additional information, it is not possible to completely distinguish househo 
undercoverage from within-household undercoverage, without making additional assum 
tions. Alexander and Roebuck (1986) present some preliminary suggestions about how a ran 
of coverage models might be fit to census and survey data. An alternative approach wou 
be to include coverage parameters in a multinomial sampling model such as those describ: 
for the MLE-H or MLE-P weighting methods. Other approaches to modelling coverage a 
presented in Wolter (1986). 


6.3 Estimation of Variances 


Methods for estimating variances of the weighted estimators have not been investigat 
for most of the constrained minimum distance methods. For raking estimators, some metho 
are available; see Arora and Brackstone (1977), Bankier (1978) and Fan et al. (1981). 

For any of the methods, replication methods for estimating the variance could be applie 
These methods have been shown to give reasonable results under fairly general conditior 
see for example Krewski and Rao (1985). It remains to be determined whether these conc 
tions can be applied to the constrained minimum distance methods. 


6.4 Computational Issues 


Zieschang (1986b) has applied the generalized least squares methods to the U.S. Consurr 
Expenditure Surveys. Scheuren (1981) describes a large-scale application of the raking meth 
to household weighting. The maximum likelihood constrained minimum distance algorithi 
(MLE-H and MLE-P) have not been tried on large-scale problems of this kind. If they we 
to be used in actual survey weighting, research may be needed to improve their compu 
tional efficiency. 
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An Integrated Method for Weighting 
Persons and Families 


G. LEMAITRE and J. DUFOUR! 


ABSTRACT 


ousehold surveys generally use separate procedures for estimating characteristics of persons and those 
families. An integrated procedure is proposed and a least-squares estimator introduced to achieve 
is end. The estimator is shown to be unbiased under certain general conditions. Using data from 
e Canadian Labour Force Survey, variances for the estimator are calculated and shown to compare 
vourably to those from current procedures. 


EY WORDS: Family estimation; Family weighting; Least-squares weighting. 


1. INTRODUCTION 


It is customary for many household surveys to incorporate in their estimation procedures 
post-stratification step in which the design-based estimates of the population, generally 
y age and sex group, are benchmarked to independent totals obtained from demographic 
urces. In practice, for ease of tabulation, a weight is normally associated with each respon- 
ng person, equal to the product of the inverse sampling rate, an adjustment for non-response, 
id an age/sex ratio adjustment factor. Estimates for a particular characteristic are then 
tained by summing up the weights of all responding persons in the sample bearing that 
laracteristic. Because of the age/sex adjustment factors, the weight so assigned will usually 
ffer from person to person within the same household. When estimating characteristics 
‘persons, this may not pose any particular problem; in producing estimates of households 
‘families, however, it is not entirely clear which weight is the appropriate one to use, if any. 

To estimate family characteristics, one might well elect to carry out a ratio estimation 
ep using auxiliary information on families as well as persons. However, reliable and timely 
ixiliary counts of families that could be used in ratio estimation are in general not available. 
$aresult of events such as births, deaths, marriages, divorces and persons leaving or enter- 
g a household, characteristics such as family size change from one census to the next, in 
ays that are less predictable than a characteristic such as age. The administrative records 
lat are the main source of information on post-censal population change (i.e. birth, death 
id migration records), do not provide information on household-related change. Birth 
cords, for example, do not provide information on the size of a family into which a child 
born. Tax records can compensate in part for this deficiency (see Auger 1987); however, 
ich records do not cover the entire population nor are they available in a timely enough 
ishion to be used in producing current estimates. In the absence of auxiliary counts of 
imilies, household surveys generally have adapted the weights obtained from ‘‘person- 
eighting’’ for use in estimating characteristics of families. For various reasons this is a 
ymewhat less than ideal solution. The present paper proposes a method of estimation that 
sults in a single uniquely defined weight per household which would be appropriate for 
oth individual and family estimation. 


G. Lemaitre and J. Dufour, Social Survey Methods Division, Statistics Canada, 4th Floor, Jean Talon Building, 
Tunney’s Pasture, Ottawa, Ontario, K1A OT6. 
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Techniques to achieve a single household weight have been proposed in the past, wit 
an emphasis on using auxiliary information on persons to improve estimates of familie: 
Oh and Scheuren (1978) proposed a method of ‘‘multivariate raking’’ which consists of suc 
cessively ratio adjusting population estimates by post-stratum by means of the ratio ac 
justments calculated for each post-stratum in turn, and then iterating to convergence. Th 
adjustments at each stage are applied to households containing persons in the particular pos 
stratum being adjusted for. Zieschang (1986) adopted a Generalized Least Squares (GL 
approach in which the sum of weighted squared adjustments to the design weights wel 
minimized, subject to a set of linear constraints. Alexander (1987) examines several constraine 
minimum distance weighting methods, including the GLS method, and evaluates them i 
the context of survey undercoverage. Although the above methods were originally propose 
as ways of improving estimates of families, the survey weights derived from the variou 
estimators can clearly be used to estimate characteristics of persons as well. This paper argur 
in favour of adopting such an integrated approach to individual and family estimation. Se 
tion 2 discusses the limitations of the current approaches to estimating characteristics of pe 
sons and families. Section 3 introduces a model-based estimator adapted from a generalize 
weighting procedure due to Bethlehem and Keller (1987). Section 4 presents some empiric 
results taken from the Canadian Labour Force Survey. Section 5 discusses plans for furth 
study. 


2. CURRENT ESTIMATION PROCEDURES 


The principal mandate of most household surveys traditionally has been to produ 
estimates for characteristics of persons, particularly of labour force characteristics. Su 
surveys adopt the household as the ultimate sampled unit essentially for reasons of cost al 
convenience. Although the household unit is normally respected in preliminary weightil 
steps (non-response adjustments, rural/urban adjustments, etc.), it is generally ignored, 
the final weighting step, i.e. no allowance is made for the fact that the members of a househe 
are sampled as a unit. In particular, any coverage biases associated with the sampled ul 
are not directly taken into account or compensated for in estimation. Undercoverage is th 
assumed to be ignorable in the sense of Rubin (1976); every person in an age/sex post-stratu 
is treated the same in estimation whether he/she is living alone or comes from a multi-pers' 
household. One study of non-response in the Labour Force Survey (Paul and Lawes 198: 
however, has demonstrated that smaller households, particularly households without childre 
tend to be underrepresented in the sample. Although no comparable studies exist for miss 
households in the Labour Force Survey, studies of private household undercoverage in t 
census have shown that non-enumerated households are indeed smaller on average th 
enumerated households (Gosselin and Théroux 1980). A missing-at-random type procedt 
can lead to biases in labour force estimates for persons, particularly if the labour force dist 
bution of persons in smaller households is different from that of persons in larger ones, 
things being equal. Intuitively, an estimation procedure which takes into account (even 
only indirectly) the fact that smaller households are more subject to non-response and und 
coverage than larger ones could correct in part for this deficiency in the sample. 

In the absence of auxiliary information on households or families that could be ine 
porated into an appropriate weighting procedure to produce a well-defined family weig 
many current methods adopt as the family weight the weight of a “principal person”’ in} 
family. In the Canadian Labour Force Survey, this person is the female spouse if prese 
otherwise the head. Since such methods do not take household composition into accou 
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amily estimates generated using this weight tend to overestimate larger families and to 
inderestimate unattached persons. In addition many characteristics (e.g., population, in- 
ome) can be estimated using either the individual weight or the family weight, and the 
stimates will in general disagree, sometimes substantially. Of course even under ideal sampling 
und interviewing conditions, with no differential non-response or undercoverage, family and 
ndividual-based estimates of the same characteristic will disagree somewhat. With a large 
nough sample, however, the discrepancies should be small. Under actual, i.e., less than ideal 
onditions, differences may be too large to explain away by a facile appeal to sampling 
variability. An estimation procedure that yields a single household weight which, when used 
is an individual weight, respects the auxiliary population totals will eliminate the awkward- 
1ess of having two estimation systems. It is these deficiencies that the estimator described 
n the following section was designed to deal with. 


3. A PROPOSED ESTIMATOR 


We begin by introducing a generalized weighting procedure based on linear models due 
0 Bethlehem and Keller (1987) and applying it first to person-based estimation as was done 
n their paper. A modification of the procedure is introduced which leads to household weights 
propriate for estimating characteristics of persons. We will borrow freely from their original 
resentation in what follows. 

Assume a survey target population consisting of N units, an N-vector Y of values of a 
arget variable, and an N by p matrix X of auxiliary variables defined for each unit of the 
arget population. The population totals for each auxiliary variable are assumed to be known 
ind will be denoted collectively by the p-vector x. In our application x will consist of age-sex 
otals. If the auxiliary variables are correlated with the target variable, then for an appropriate 
y-vector B, the values of EF = Y — XB will vary less than the values of the target variable 
Y. Ordinary least squares on all units of the target population yields 


Bem ee Xe (3.1) 
wrovided X is of full rank. A sample-based estimate for B is given by 
Bee Xe ae We Itt ye (22) 


where T is a diagonal matrix whose i-th element is 1 if the i-th unit of the population is in 
he sample, 0 otherwise, and E(7) = z. 

It can be shown that for large samples B will be approximately unbiased. The parameter 
of interest, however, is not B but the population total y. If we define § = B’x, ¥ will be 
an approximately unbiased estimator of y provided that B’x = y, or equivalently, provided 
he sum of the residuals for the population model Y = XB + E is equal to zero. This will 
lold if the N-vector whose elements consist of ones is in the space spanned by the columns 
of X, and in particular, if the auxiliary variables X include an exhaustive and mutually ex- 
lusive set of indicator variables (for age/sex groups, for example). 

If we write § = B’x = Y’I~'TX(X'Tl'!TX) ~!x, we see that the estimator implicitly 
Jefines an N-vector of weights given by 


Wire ene IXY Dx; 
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that do not depend on the particular target variable being estimated. If these weights ar 
used to produce sample estimates for the auxiliary variable characteristics, we have the 
X’W = x, so that the weights do indeed yield the appropriate population totals. Further 
more if X consists exclusively of an exhaustive and mutually exclusive set of indicator variables 
then the regression estimator Y will be equivalent to the ordinary post-stratification estimator 
For further details, see Bethlehem and Keller (1987). 

The weight of an arbitrary sample person / under this procedure can be expressed general 
ly as 


Wi=) “ay, (3.3 


where (b,,..., b,) = (X’II~'TX) ~!x and 7; is the inclusion probability for person i. Thi 
suggests that the estimation method described above can be adapted to yield the desired weight 
by defining the auxiliary variables in the same way for all household members. An obviou 
way to do this is to define auxiliary variables at the household level, for example by replac 
ing the corresponding variables defined at the person level by the household mean. Mor 
formally let Z be an N by p matrix defined for person/ (i = 1,..., N) belonging to househol: 
1 TS Sy OM 


where U,, is the total for characteristic j in household A, i.e. U,; = Ly X,;, with the sum 
mation being over all members k of household h, n, = size of household h, and L, n, = A 
Let Y again be an N-vector of values for an arbitrary target variable defined on persons 
As in person-level estimation, we work with the population model Y = ZC + E and appl 
least squares to the sample data to obtain an estimate 


CES Ibe TZ 7 ys (3.4 


We define § = C’x where x is again the vector of population totals for the auxiliary variables 
y will be an approximately unbiased estimator of y provided the N-vector of ones is in th 
space spanned by the columns of Z. In a manner analogous to (3.3), the weight for an ar 
bitrary sampled person in household h will be given by 


U pC; | 
W,= Ye. 3 
r=) men ( 


Since each household member contributes the same row vector to Z and since each has th 
same first order inclusion probability, each person within a household will have the sam 
weight. Furthermore the use of the household weight as a person weight yields the correct aux 
liary population totals. Although it is possible to obtain negative weights under this procedur 
(if some of the c;’s are less than zero), for well-behaved samples (i.e., not subject to seriou 
non-response or undercoverage) households whose weights are changed substantially by thi 
procedure tend to be households of unusual composition that are uncommon in the sampl 
and in the population at large. Recently in weighting twenty-four months of Labour Fore 
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Survey data under this procedure, only one household had a (small) negative weight attributed 
to it. Negative weights are problematic because it is difficult to attach the usual meaning 
one assigns to weights, that is, the number of persons/households in the population at large 
represented by a particular sampled person/household. However, under the formulation 
described above, the final weights are defined only implicitly and indeed could be viewed 
as merely a convenient means of generating estimates. In practice even with some negative 
weights, it is unlikely that a meaningful estimate of level for a characteristic of interest would 
turn out negative. The problem of explaining a negative weight to a mystified user is of course 
a different question. 

The variance of the estimator § = C’x described in this paper can be obtained using 
methods described in Fuller (1975). In addition the estimator can be shown to be equivalent 
to the GLS estimators proposed by Zieschang (1986) and Alexander (1987) when the space 
spanned by the auxiliary variables Z contains a vector of ones. Further properties of this 
type of estimator can be found in Wright (1983). 


4. EMPIRICAL RESULTS 


The Canadian Labour Force Survey is a monthly rotating panel survey of approximately 
48,000 households across Canada (see Platek and Singh 1976 and Singh, Drew, and Choudhry 
1984). Households once selected remain in the sample for six consecutive months before being 
replaced. The primary geographic strata are the ten provinces. Sample sizes vary from a low 
of 1500 households in Prince Edward Island, the smallest province, to about 9000 households 
in Ontario, the most populous one. The survey collects data concerning the labour market 
situation of respondents during a reference week each month and publishes a wide variety 
of estimates related to the nation’s labour supply. 

A preliminary evaluation of the estimator described above was carried out using data from 
one of the monthly surveys. May 1981 was chosen to permit comparisons to results from 
the 1981 census held at about that time. Although we have been using the terms ‘‘household’’ 
and “‘family’’ interchangeably up to now, user interest is often focused on estimates of 
“economic families’’, which consist of all persons in a household related by blood, mar- 
riage, or adoption. For weighting purposes it is conceptually more appealing to deal with 
the actual sampled unit, i.e. the household. However, the empirical results presented here 
will be based on estimates for economic families. The evaluation carried out focused on both 
characteristics of persons (labour force status) and of families (number of economic families 
and number of unattached persons). The least-squares weighting was carried out for two 
sets of five-year age/sex groups, with persons seventy and over being grouped according to 
sex. The first set of (twenty-four) age/sex groups excluded children 0 to 14 years of age from 
the weighting, to permit a comparison to a standard person-based post-stratification estimator 
using the same auxiliary information. The second set included children grouped into six age/sex 
groups and was used only for least-squares weighting, since under standard post-stratification 
the weighting of children would have no effect on the weighting of persons 15 and over. 

Although all estimators considered are approximately unbiased for estimates of 
characteristics of persons, each makes different assumptions about the nature of under- 
coverage and non-response. (The Labour Force Survey’s non-response adjustment procedure 
assumes that non-responding households are missing at random within geographic area). The 
post-stratification estimator implicitly assumes that any differential non-response and under- 
coverage depends only on age and sex and is therefore adequately compensated for by 
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person-based estimation using auxiliary information on these characteristics. Under least 
squares weighting, the weight of a person will depend on the age/sex composition of thi 
household (without children in one case, with children in the other). Thus, all things bein; 
equal, one would expect the design weight of a person belonging to an age/sex group subjec 
to substantial undercoverage to be adjusted less if that person is living with persons belong 
ing to age/sex groups well covered by the sample than if he/she is living alone. 

Since the auxiliary population totals by age and sex are available by province, estimatio1 
was carried out separately for each province. However, the smaller provinces have been col 
lapsed into two groups in the following tables. 

In general the three estimators do not yield substantially different estimates, particularh 
A and B. The inclusion of children in the weighting does appear to lead to slightly highe 
estimates of employment and of unattached persons and slightly lower estimates of economi 
families nationally and in the larger provinces (compare results from Scheuren ef a/. 1981) 
This is in line with expectations, although there is still some ground to cover vis-a-vis censu 
results, which show (rounded to thousands) 6,369,000 economic families and 2,583,000 unat 
tached persons at the national level. The moral of the tale is that, although the least-square 
estimator does take us part of the way home (when the presence of children is taken int 
account), it will require accurate and timely auxiliary information to eliminate the residual bias 


Table 1 


Number of Persons Employed and Unemployed, Number of Economic Families and 
Unattached Persons, Labour Force Survey, May 1981 (In Thousands) 


Economic Unattache 

Estimator? Employed Unemployed Families Persons | 

Canada A 11,094 850 6,424 2,432 1 
B 11,090 850 6,446 2,442 
Cc 11,120 851 6,410 2,495 
Atlantic <A 819 102 563 156 
Region B 819 102 570 154 
C 821 102 569 156 
Quebec A D2 304 13728 587 
B 2,724 304 Levi25 596 

¢ 2,13) 305 1,714 614 
Ontario A 4,198 274 DRS 863 
B 4,200 213 25525 861 

Gc 4,211 273 2,310 881 | 

Prairie A 2,074 83 1,078 506 
Region B 2,072 84 1,089 510 
© 2,074 83 1,085 517 

British = A 1,277 88 735 319 

Columbia B 1,276 88 738 321m) 
C 1,280 88 734 327 


dO 
“ A = post-stratification/principal person, B = least squares with children excluded from weighting and C = lea. 


squares with children included in weighting. | 
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The expected performance of the least-squares estimator with regard to efficiency is not 
iltogether obvious. Certainly, if one were to base a prediction on the results observed above, 
hen the similarity of the estimates to those produced by the post-stratification estimator 
vould lead one to expect it to perform as well as the latter. On the other hand, one might 
xpect efficiency gains for estimates of economic families, because of the fact that the least- 
quares estimator makes use of the auxiliary population totals in determining the household 
veight. However, a single weight per household is not achieved without some redistribution 
yf weights at the micro level. 


Table 2 


Distribution of Percent Deviations of Final Weights Relative 
to the Design Weights, Labour Force Survey, May 1981 


Percentage of Total Sample 


Percent 
_ Deviation Least-Squares 
Post-Stratification Least-Squares (With Children) 
> —30% 0.0 0.1 0.2 
-30 to —20% 0.0 0.5 0.9 
-20 to — 10% 0.6 320 5.3 
-10 to 0% 235-9 20.4 ag ot | 
Oto 10% 53.9 44.6 3708 
~10to 20% 20.6 26.3 21.6 
20 to 30% 0.6 4.4 6.2 
30 to 40% 0.1 0.4 0.9 
~40to 50% 0.0 0.0 0.2 
< 50% 0.0 0.0 0.2 


Note: Sample size is N = 159014. 
Table 3 


Estimated Efficiencies of Least-Squares Estimators Relative to 
Post-Stratification Estimator, Labour Force Survey, May 1981 


Economic Unattached 
Estimator? Employed Unemployed Families Persons 
Canada B 1.044 0.999 1.565 1.038 
C 1.066 0.999 1.616 1.036 
Atlantic B 1.110 0.977 1.266 0.998 
Region C 1.193 0.992 Pe SOr 1.070 
Quebec B 1.059 1.005 1553 1.020 
C 1.063 0.992 1.582 0.992 
Ontario B 1.028 1.011 1.825 1.064 
C 1.059 1.010 1.828 1.037 
Prairie B 1.001 1.009 1.205 1.009 
Region Gs 1.072 1.066 1.420 1.134 
British B 1.038 0.964 1.248 1.048 
Columbia GC 1.053 0.978 1.203 1.045 


’B = least squares with children excluded from weighting and C = least squares with children included in weighting. 
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As Table 2 illustrates, the least-squares weights have a somewhat greater dispersion than 
those based on standard post- stratification methods. Including children in the weighting 
results in an even greater dispersion. The movement in the weights essentially reflects the 
extent to which the age/sex household size composition of the sample fails to mirror that 
existing in the general population. Since the objective of a single weight per household 
imposes an additional constraint on the estimation procedure, one might expect variances 
to suffer somewhat, particularly if no additional auxiliary information is brought to bear 
in estimation. 

Variances for the post-stratification estimator were estimated using the Keyfitz method 
(1957) with PSU’s (primary sampling units) or collapsed PSU’s as replicates. The least-squares 
variances were estimated using the method described in Fuller (1975). To ensure comparability, 
variances for several characteristics estimated by means of post-stratification were calculated 
using the Fuller technique and compared to those from the Keyfitz approach. In all cases 
the two sets of variance estimates were very close (within one or two percent). 

Table 3 summarizes the estimated efficiencies of the least-squares estimators relative to 
post-stratification for the characteristics considered in Table 1. The efficiency gains for 
estimates of economic families are substantial. Estimates of persons employed and of unat- 
tached persons also appear to gain somewhat; however, the variance reductions for these 
characteristics are small, with the exception of employed in the Atlantic Region, particularly 
when children are included in the weighting. Interestingly average family sizes in the Atlan, 
tic Region are higher than in the rest of the country, although it is not clear how this would 
affect estimates of employed persons. The variances for the characteristic unemployed are 
essentially unaffected by the least-squares procedure. One can probably expect these results 
to hold in general, i.e. for arbitrary characteristics. Although the one-weight-per-household 
criterion is a restrictive one for estimates of characteristics of persons, the least-squares 
estimators appear to compensate through the additional ‘‘explanatory”’ variables of the linear 
model, i.e. the household means of all auxiliary variables. The above preliminary results 
suggest that individual and family estimation could be integrated at little or no loss in effi- 
ciency for estimates of persons. 


5. PLANS FOR FURTHER STUDY 


The results presented in this paper are preliminary, and a more extensive empirical evalua- 
tion of the properties of the least-squares estimator is currently under way, with particular 
attention being given to the behaviour of estimates over time and to efficiencies for a larger 
group of characteristics relative to estimates produced with the Labour Force Survey’s cur- 
rent raking ratio estimator. The foregoing results have suggested that at least for some 
characteristics of persons, the ‘‘explanatory power’’ of the age-sex composition of a household 
is at least as great as that of the age-sex group alone. It will be instructive to see if the relative 
efficiencies will be as favourable for characteristics more strongly correlated with age-sex. 
In addition although in practice negative weights have been uncommon, it is likely that some 
procedure must be developed to deal with them when they occur. Among the possibilities 
one might consider would be to accord them outlier treatment or perhaps to forestall thei 
occurrence by imposing some bound on changes to the weights (Zieschang 1987). Finally 
it would be useful to make explicit the undercoverage model underlying the least-squares 
estimator to permit an evaluation of the model on its own merits. 
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Modified Raking Ratio Estimation 


H. LOCK OH and FRITZ SCHEUREN! 


ABSTRACT 


\ hybrid technique is described that employs both conventional and raking ratio estimation to handle 
he case when the population frequencies Nj in a two-dimensional table are known, but some of the 
ybserved frequencies n;; are small (or zero). Results are provided on the approach taken as it has evolv- 
sd in the Corporate Statistics of Income Program over the last several years. Changes are still being 
‘onsidered and these will be discussed as well. 


<EY WORDS: Raking ratio estimation; Conventional ratio estimation; Conditional bias and variance. 


1. INTRODUCTION 


Raking ratio estimation, or simply ‘‘raking,’’ is a widely used technique in sample surveys. 
Applications differ depending on the nature of the sample design, the extent of the auxiliary 
nformation available and the presence of various nonsampling errors (such as might arise 
yecause Of nonresponse or undercoverage). 

Raking was first proposed by Deming and Stephan (1940) as a way of assuring consisten- 
-y between complete count and sample data from the 1940 U.S. Census of Population. The 
riginators themselves elaborated their ideas early on (Deming 1943; Stephan 1942). Since 
hen, perhaps because of the basic intuitive appeal of the iterative algorithm employed, there 
iave been several wholly independent rediscoveries of the technique (Fienberg 1970). 

Advances and modifications have also been numerous. For example, important theoretical 
work on convergence of the algorithm was done by Ireland and Kullback (1968). As might 
ye expected, practitioners at Statistics Canada, and also at the U.S. Bureau of the Census, 
lave deeply studied the application of raking in census and survey taking, especially in situa- 
ions where the raking is not allowed to proceed to complete convergence (e.g., Brackstone 
ind Rao 1979; Fan ef al. 1981). A reasonably complete bibliography of the statistical research 
yn raking prior to 1978 can be found in Oh and Scheuren (1978b). 

In many treatments of raking, it is assumed that two (or more) sets of marginal popula- 
ion totals, say N;, and N;, are known, but that the interior of the table Nj; can only be 
stimated from the sample. When the N,; are also known, the usual ratio estimator with 
weights Nj;/nj; would be the natural choice, unless the corresponding sample sizes nj; are 
‘too small.’’ 

The present paper describes a hybrid technique that employs both conventional and rak- 
ng ratio estimations to handle the case when the population cell frequencies N;; are known, 
ut some of the observed frequencies n;; are small (or zero). In Section 2, we describe our 
approach. Some empirical results from the application of the method to our Corporate 
Statistics of Income Program are covered in Section 3. In Section 4, we conclude with a brief 
summary and some plans for the future. 


-H. Lock Oh and Fritz Scheuren, Statistics of Income Division, Internal Revenue Service, 1111 Constitution Avenue 
N.W., Washington, D.C. 20224, U.S.A. 
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2. RAKING RATIO ESTIMATION 


2.1 General Considerations 


Raking ratio estimation usually assumes that two (or more) marginal population totals, 
say, N; and N ; are known, but that the interior of the table Nj; can only be estimated from 
the sample by, say, Nj, where graphically (Deming 1943) we have 


er 21 
y= Ny 


S 
2 Ni = Ni. (22 


and 


R ~ 
SE iG = Nj (2.3 
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are satisfied in turn. Each step in the algorithm begins with the results of the previous step, 
with the Ni continuing to change; the process terminates either after a fixed number of steps 
or when expressions (2.2) and (2.3) are simultaneously satisfied to the closeness desired. (See 
Oh and Scheuren (1983) for further details; see Ireland and Scheuren (1975) for generaliza- 
tions to multi-way tables and the handling of computational efficiency issues.) 

By an application of the theory of minimum discrimination information (Kullback 1968), 
it can be shown (e.g., Ireland and Kullback 1968) that, under some regularity conditions if 
only the N; and N ; are known, the Nj obtained by raking to convergence are asymptotically 
unbiased, normally distributed and minimum variance (i.e., best asymptotically normal, or 
BAN, estimators). Theoretical results of this kind are partly what motivates the raking 
estimator for a general survey characteristic Y;, (e.g., income or assets), where we are in- 
terested in estimating the population total 


Bish Sit 
Y= i De Da Yiix (2.4) 
i j k 
with, say, the statistic 
sai R S N.. Ma} 
y= ys Ms Ht Ds Yiix |. (2.5) 
i J i k 


Wi = — (2.6) 


is placed on each individual record on the file for ease of handling. It is important to note 
that a feature of the raking algorithm is that if n,, = 0 then necessarily Nj = OneOrD cOn- 
venience, let W;; = 0 in such cases as well. 

Our interest below will be mainly on the conditional properties of the various estimators 
being examined. Such an approach has considerable appeal, as advocated by Holt and Smith 
(1979) and Rao (1985). (As an aside, it may be worth noting that Brackstone and Rao (1979), 
among others, have looked at the conditional behavior of the raking estimator. They condi- 
tioned, however, on the sample marginals n, and n ;.) 


2.2 Conditional Bias 


Following Oh and Scheuren (1983) we focus primarily in this paper on the conditional 
properties of Y, givenn = (M1, M2 ..., Nps). In particular, let Gs be the population mean 
for the ij-th subgroup. Then the conditional expected value of Y is 


R S 
EGBlin) Sa TinbseN py, = We oypladgn yl) (NaN pe ORU2.7) 
F | 


i 


Thus Y is conditionally biased with the importance of the bias depending on the structure 
of the population and whether or not the raking is to convergence. (Of course, when raking 
to convergence, unconditionally E(N,;) = N, asymptotically.) 
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Employing the usual analysis of variance conventions (e.g., Scheffé 1959) 
Cipaael Se (Yiee Phi YK) CS ek) coolers) (2. 
hence the conditional bias, given vn, is expressible as 


PEAR Gay = Yh PR wd Yh Phy Nj) 
dj 


ae 
~Es 


6 | 
ell Gress Vol Decne (Nj — Nj). (2. 
ji 
If the raking is to convergence, then the first two terms of the conditional bias become zer 
For the third term of the conditional bias to be zero for either form of raking, it is sufficie 
that the Y;; be such that there is no interaction. In large-scale surveying with many variable 
this is unrealistic to assume; nonetheless, in practice the interaction is often a minor pa 
of the decomposition of Yj; consequently, the raking ratio estimator may, in many | 
have small biases even in moderate sample sizes. 


2.3 Conditional Variance 


Conditional and unconditional approaches to the variance of the raking ratio estimat) 
have been extensively examined (e.g., Binder 1983; Causey 1972; Bankier 1986; Fan et ¢ 
1981; Brackstone and Rao 1979). In our own early work (described in Section 3.2), we ha’ 
employed replication techniques (e.g., Leszcz, Oh and Scheuren 1983). The replicatic 
methods used (which were equivalent to conditioning on the sample marginals) proved e 
pensive, unwieldy, and somewhat unstable, leading us to a simpler attack on the condition 
variance estimation problem (albeit the level of conditioning was deeper). | 

To motivate the approach we are currently taking, consider the conditional variance ( 


Y, given n. Now it can be shown by a slight extension of Oh and Scheuren (1983) thai, 


Ra Ss | 

Var e¢¥) pm)p=td) bi) dong, (: - zt) Vij (2.1 

i j Y | 

where the Vj; are the population variances of the ij-th subgroup and if N;; = 0 or 1 we defi 


V0: (We are also employing the convention in expression (2.10) t et 070°5= 108) | 


Expression (2.10) holds whether or not the raking goes to convergence. Despite this 
has been little studied because it cannot be readily adapted to estimate the conditional variance 
The principal difficulty, of course, lies in our inability to calculate stable estimators of t] 
V;; when the nj, are small. To overcome this problem we began looking at collapsing techr 
ques based on the size of the raking weight. First, we let W,, approximate Nj/ nj, whic 
gives us | 

| 

oe, iS ome" | 

Var GY ine BD GAA (Wi, — 1) Vj. (2.1 

, | . 

Now if the W,, are ordered from smallest to largest and if they vary over a narrow rang! 
then averaging them into (ordered) groups of, say, about n, = 25 observations each w. 


| 
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Iter the value of expression (2.11) very little. It will, however, allow us to calculate collapsed 
ost-stratum variance estimates for the V;;. This is the approach we have taken in Section 3. 
_ One final point should be noted. The Rijoinatne proposed here is stable and fairly easy 
o calculate. Our limited empirical work, however, is inconclusive on the method’s utility 
ind, while we feel the method is worthy of discussion, we are in no sense advocating its general 
ise at this time. 


4 Modified Raking Estimation 


_ As we have noted, under fairly general conditions the Ny are BAN estimators. This does 
iot mean, however, that Y will share all these properties. Indeed, if the variables used in 
he raking are not highly correlated with the characteristic Y, the estimator Y may suffer 
ome degradation in variance relative, say, to a simple ratio estimator 


N, MU 


a Ss 
les. a Y L Yee (2.12) 
a 


ypically, of course, experience has shown that both positive and negative impacts may oc- 
ur in the same sample. The practitioner’s problem is somehow to keep the positive effects 
vhile minimizing the negative ones. 

There seems to be no general solution to this dilemma but we have had some limited suc- 

esses, in our application settings, with two techniques that may be of wider interest (see 
ubsections 3.2 and 3.3 for results). 
In most treatments of raking, it is assumed that the marginal population totals N; and 
v j are known; and that the interior of the table N,; can only be estimated from the sample. 
n our setting we actually have the population values Nj; and are employing raking as a way 
f systematically handling cells in the table where the nj are small. Conventional collapsing 
Iternatives exist here, of course (e.g., Cochran (1977) Fuller (1966)); but seemed unsuitable 
or reasons that will be explained later. 

It may be possible to agree that raking is a satisfactory way of handling the small cells 
n this setting; but what about the larger ones? Surely it would be better to use the conven- 
ional simple ratio estimator in the large cells. Indeed, if this were done, the conditional bias 
or these ‘‘large’’ cells would be zero; but what would be the effect on the rest of the cells? 
‘his line of reasoning suggested that we employ a hybrid estimation method where, for cells 
vhere the n;; was large, the conventional simple ratio estimator is used. These cells are then 
emoved from the population and sample tables, and the remaining sample cells are raked 
0 the adjusted population marginals. 

For the remaining smaller cells, a second procedure was introduced to reduce the possible 
egative impacts of the raking on certain variables. We bounded the raking so that the weights 
V;, did not vary ‘‘too much’’ from the initial weight. (This kind of constraint is often 
mployed, by the way, in simple ratio estimation, e.g., Hanson 1978.) 

The approach to bounded raking ratio estimation is similar to that when ‘‘large’’ sample 
ounts are available in a single cell. That is, it is similar in that, for the cell that is to be 
Onstrained, we bound the Wis then take the estimated population total Nj = Wi nj for that 
ell and the sample n, for that cell out of the population and out of the sample tables 
respectively); and then adjust the remaining observations. 

Three problems exist with these partial ‘‘solutions.’’ First there is the (uncomfortable) 
rbitrariness of the definitions of a ‘‘large’’ cell, and of a weighting factor that varies ‘‘too 
nuch’’ from its initial value. A related concern was why, if we were willing to use simple 
atio estimation for ‘‘large’’ cells, conventional collapsed stratum techniques could not be 
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used for the remaining cells. The third problem has to do with the properties of the raking 
algorithm’s convergence when we employ this hybrid. It is quite clear, for example, from 
the research that has been done on raking that tables with too many zeros in them will be 
very unstable and the raking may not converge (e.g., Oh and Scheuren 1978a and 1978b; 
Ireland and Scheuren 1975). This is of particular concern since the effect of both our modifica: 
tions is to introduce zeros into the table. If these zeros are strategically placed, or better, 
misplaced, then this could have a very serious detrimental impact on the rate of convergence 
and, even, on the quality of the estimators. Our recommendation before starting was, 
therefore, that the number of times that these procedures were employed would have to be 
fairly small. It is beyond the scope of the present paper to resolve these concerns in genera 
(if indeed that is possible). In Section 3, however, we will consider them further for the appliec 
setting in which we did this work, and also will return to them in Section 4, when discussing 
areas for future study. 


3. RAKING IN THE CORPORATE STATISTICS OF INCOME PROGRAM 


3.1. Background 


The U.S. Internal Revenue Service has produced statistics from corporate tax returns an: 
nually for over 70 years. Corporate data are, in fact, a mainstay of the so-called Statistic: 
of Income Program, which is the name collectively given to all of the non-administrative 
statistical series produced by the Internal Revenue Service for public consumption. 

Until 1951, corporate statistics were based on a complete census of the returns filed. Sines 
then, a stratified probability sample has been employed, currently running in size at abou 
90,000 returns annually (from about 3,000,000 returns filed). Assets and income are th« 
principal stratifying variables (Jones and McMahon 1984). Stratification by industry has lon 
been considered, as well, but the quality of the industry coding as self-reported by taxpayer! 
seemed insufficient to justify this step on a wholesale basis. Typically, for example, at the 
minor industry level perhaps 20 percent or more of the self-reported codes are changed durins 
statistical processing. Nonetheless, because of the importance of industry statistics, effort 
to use administrative data by industry to post-stratify the sample still seemed warrante¢ 
and have been pursued over many years (e.g., Westat, Inc. 1974; Leszcz, Oh, and Scheurer! 
1983). 

In a pilot post-stratification study done by Westat during the early 1970’s, substantia 
improvements in standard errors were achieved for a number of variables, notably Tota 
Receipts (where a reduction of about 12 percent occurred). Some increases in standard er 
rors took place, however, for variables not closely related to industry (e.g., distribution t 
shareholders), but these were minor. To handle small cells, Westat used conventional col 
lapsed stratum techniques to combine industry post-strata within the then-existing sampl 
strata. Concerns continued to exist about the quality of the administrative industry data 
especially for small cells; in any case, due to other operational priorities, the Westat approacl 
was never implemented. 

A major series of budget cuts occurred during the 1980-1982 period, and these force 
a number of changes in the sample designs and estimation procedures across nearly all th’ 
studies that make up the Statistics of Income Program (e.g., Hinkins and Scheuren 1986 
Scheuren, Schwartz, and Kilss 1984); in particular, the corporate study experienced sampl 
size cuts during this period which, although later partially rescinded, reopened the issue 0 
post-stratification by industry. 
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A raking ratio estimation approach to post-stratification seemed to have appeal over what 
Westat had done. One of the reasons for this was that concerns about the quality of the 
marginal administrative totals, by industry, were not as great as for the individual cells. The 
work of implementing a collapsing scheme could be completely avoided, as well. 


3.2 Early Modified Raking Results 


When we implemented a pure raking scheme for the Tax Year 1979 sample, our principal 
customers expressed concerns about what we had done. They were particularly worried about 
the potential for large adjustment factors having an adverse effect on certain statistics. We, 
in turn, having seen the results ourselves, were concerned that we had not done an adequate 
job for those industry-sample stratum combinations where the number of sample observa- 
tions were large. As a consequence, these results were never used and the 1979 Tax Year 
statistics were published employing normal stratified sampling estimation (NORM). 

Research continued, however, and in 1983, a paper was given comparing the root mean 
square errors of six different variations of raking both with each other and with what we 
had been doing previously (Leszcz, Oh, and Scheuren 1983). Three ‘‘pure’’ raking alternatives 
were looked at: 


PRRE: “‘Classical’’ raking ratio estimation to convergence (Deming and Stephan 
1940); 


PRRE (200): Simple ratio adjustment of cells with samples of 200 returns or more and 
“*classical’’ raking of the remaining cells to convergence; and 


PRRE (400): Simple ratio adjustment of cells with samples of 400 returns or more and 
“‘classical’’ raking of the remaining cells to convergence. 


In addition, three versions of bounded raking ratio estimation were examined, all with 
the bounds set at (V2/3, V3/2). These were: 


BRRE: Bounded raking ratio estimation (2 cycles); 


BRRE (200): Simple ratio adjustment of cells with samples of 200 and bounded raking 
(2 cycles) of the remaining cells; and 


BRRE (400): Simple ratio adjustment of cells with samples of 400 and bounded raking 
(2 cycles) of the remaining cells. 


For the bounded raking we were initially not sure that complete convergence was possi- 
ble; hence, we made an operational simplification and only cycled through the constraint 
equations, e.g., (2.2) and (2.3), twice. 

To make the root mean square error (RMSE) comparison, pseudo-replicate half-samples 
were drawn, each designed in the same way as the overall sample. The procedure involved: 
()) construction of the half-samples; (2) two-way classification - by original sample stratum 
and major industry (post-stratum) - of sample counts for each half-sample; (3) derivation 
of a set of weights for each half-sample for each estimator; (4) calculation of estimates of 
selected items by applying the weight to sample values for each half-sample; and (5) calculation 
of the RMSE, based on the variations in the estimates that each half-sample produced. For 
cost reasons only 14 sets of half samples were used. 

The resultant summary tabulation presented as Table | reveals what one would have expected 
of the number of returns. Near 100 percent reductions occurred for the PRRE, PRRE(200), 
and PRRE(400) estimates. Application of the bounding limits V2/3 and V3/2, and not 
cycling to convergence, decreased the magnitude of these reductions; however, they were 
still substantial. As Table 1 also indicates, for Total Receipts, a key variable, there were also 
improvements, although much less sizable. 
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Table 1 


Reduction in Root Mean Square Error (RMSE) 
as a Percent of Corresponding Normal Stratified Sampling RMSE 


Number 
of Total Jobs 
Estimator Returns Receipts Cred 
‘‘Pure’’ raking ratio estimators: 
PRRE 98.6 8.3 — 3x 
PRRE (400) 98.6 9.2 — 3 
PRRE (200) 98.6 11.9 —34 
Bounded raking ratio estimators: 
BRRE 74.0 13.8 + 1.0 
BRRE (400) 73.4 15.6 +1.0 
BRRE (200) Toe 17.4 1.0 


Note: The percentages shown are simple averages of the percent reductions in each of the 56 major industry grouy 
used in the post-stratification. Notice that the percentage improvements for the ‘‘number of returns’’ column ai 
nearly but not 100 percent for the PRRE estimators. This occurs because the raking took place for all corporation 
with both the Nij and nj defined on this basis; however, only active corporations (about 90 percent) were tabulate: 
The BRRE estimators in the ‘‘number of returns’’ column differ from each other and from the PRRE estimat« 
because the cycling was not to convergence. This has subsequently been changed, beginning with Tax Year 198: 


Jobs Credit results in Table 1 are included to illustrate the expected tradeoff that can exi: 
for items not closely related to industry. In particular, we see that in some cases there ar 
(modest) increases in the root mean square errors for this item, due presumably to the fac 
that this field is less dependent upon the industry groupings utilized in this research. 

It should be noted that, for Total Receipts, the decreases shown in the root mean squar 
error, from the initial (NORM) estimate to that utilizing raking ratio estimation, all con 
pare favorably with the Westat pilot study results. While we are encouraged by this com 
parison, a great deal has changed over the decade between the earlier Westat results an 
those in Leszcz, Oh and Scheuren (1983). What would really be telling, and what has nc 
been done, is to compare conventional collapsing schemes with our modified approach t 
raking on the same data set. 

One final point about Table 1; it reflects improvements in RMSE when tabulating by th 
administrative industry information which was used in the post-stratification. Because ¢ 
differences between the administratively and statistically assigned classifications by industry 
the figures shown in this table are therefore likely to overstate the improvements being achieve 
in our published statistics, since so many entities (over 20 percent) are recoded during th 
indepth processing done of our corporate sample. 


3.3 Current Modified Raking Results 


Beginning with Tax Year 1980, we began to regularly produce and publish our corporat 
statistics using the bounded raking ratio estimator BRRE(200) (U.S. Department of Treasut 
1984). For Tax Years 1983 and later, we made the modifications described in Section 2. 
so that approximate conditional variances could be calculated. These were first publishe 
for Tax Year 1984 (U.S. Department of Treasury 1987). Also, in an effort to confirm th 
earlier results, we undertook for Tax Year 1984 to compare the conditional variance of th 
modified raking method being employed with the variance that would have been estimate 
had we used normal stratified sampling estimation. Before discussing the limited comparisor 
made, it might be worthwhile giving some of the application details on the corporate settin 
for 1984. 
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In our earlier work (Leszcz, Oh and Scheuren, 1983), and for 1984, the entire corporate 
return population of IRS Forms 1120 and 1120S was tallied into 58 major industry groups. 
For 1984, industry was cross-classified by 14 sample strata in each of the two processing 
years during which the sample had to be selected. Some of the major industries were so sparse 
that we immediately collapsed the industry detail to 56 groups. This still left a very large 
table (of 1568 cells). 

It may be of interest to note that there were 414 ‘‘natural’’ zero cells in the population 
and an additional 125 zero cells arising in the sample. Before raking we removed 96 cells 
that had 200 or more sample observations; these cells were then each ratio adjusted separately. 
(In all, 57 percent of the Forms 1120 and 1120S corporate sample were so adjusted.) Finally, 
there were 73 cells that had to be bounded during the raking itself. This meant that altogether 
in the raking step there were 708 or 45 percent of the cells being treated as zeroes. 

The raking was initiated by introducing the normal stratified estimator into each cell of 
the table. The marginal constraints imposed were (1) by industry and sampling period, and 
(2) by sample strata and sampling period. In the published statistics for 1984, and in the 
comparisons made here, the raking did not go to convergence; it was just carried out for 
two cycles. (Incidentally, concerns about the conditional bias of this approach have led us 
to rake our 1985 sample data to convergence.) 

The results of the efforts for 1984 were to reduce the overall and industry-by-industry 
standard errors for frequencies by substantial amounts - only about half as much, however, 
as is shown in Table 1. Similar dampened improvements occurred for Total Receipts (8.7 
percent) with many variables like Jobs Credit and Net Income experiencing little or no change 
in their standard errors overall (see U.S. Department of Treasury 1987, for details). As already 
noted, conditioning may be part of the reason for this difference (Holt and Smith 1979). 
The original results were conditional on the sample marginals n; and n ;; the later figures 
employed a deeper level of conditioning. 

We are still examining other possibilities as to why the improvements are more modest 
than we found in the earlier work. Some obvious possibilities are the way we grouped the 
data from the smaller cells, including the consequent averaging of the weighting factors W,, . 
and the collapsed variance estimation of the V,;. Tabulating the data using our statistical 
industry coding, rather than the administrative eating) as in Table 1, may have been a major 
factor. 


4. CONCLUSIONS AND AREAS FOR FURTHER STUDY 


4.1 General 


The modified raking approach for our corporate sample certainly seems to be an improve- 
ment over the normal stratified sampling approach taken formerly. There are, however, a 
number of unsettling ad hoc aspects of the method that trouble us. For instance, the connec- 
tion between conventional collapsed stratum techniques and our modified raking procedure 
needs more study. Exploring changes in estimation techniques is not enough, however. More 
work on the basic sample design appears needed too. Finally, the variance approximation 
being used needs further looking at. We may well have paid a high price for stability and 
ease of calculation. As noted earlier, the statistical literature is full of good alternatives, and 
these deserve to be examined in a full-scale comparison with what we are currently doing. 


4.2 Estimation Issues 


There is considerable intuitive appeal in developing a post-stratification method that 
smoothly increases the degree of conditioning from just using marginal totals to using some 
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or all of the interior population counts as well. Our current approach has an embarrassing 
ad hoc flavor. Frankly, we see it just as a stop gap until we can increase the quality of the 
underlying administrative data by industry. Our main concern is to reduce response varia- 
tion arising from taxpayer or processing errors. Even if we are unsuccessful in improving 
the administrative data directly, it may be possible to dampen the response error effects by 
looking at the tables by industry and sample stratum over several years. This is planned and. 
may allow us to integrate, in a more complete way, raking on the one hand and collapsed 
post-stratum estimation on the other. 


4.3 Design Issues 


Improved administrative data by industry has obvious uses at the design stage. At the 
present time, coefficients of variation differ quite widely by industry, with the smaller in- 
dustries being very poorly represented. No amount of after-the-fact post-stratification can 
correct for this completely. Improving the balance by industry, and over time, appear to 
be top priorities (e.g., Hinkins, Jones and Scheuren 1987). 
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Comparison of the Horvitz-Thompson 
Strategy with the Hansen-Hurwitz Strategy 


S.G. PRABHU-AJGAONKAR! 


ABSTRACT 


The Hansen-Hurwitz (1943) strategy is known to be inferior to the Horvitz-Thompson (1952) strategy 
associated with a number of IPPS (inclusion probability proportional to size) sampling procedures. 
The present paper presents a simpler proof of these results and therefore has some pedagogic interest. 


KEY WORDS: Sampling strategies; Inclusion probability proportional to size; Positive definite 
quadratic form. 


1. INTRODUCTION 


Let U be a finite population consisting of N identifiable units [U,, U2, ..., Ux]. With 
the i-th unit of the population U; are associated two numbers X; and Y;, where X;s are 
known and Y;s are fixed but unknown. Generally, X; represents a measure of size of U; 
which is highly correlated with Yj. 

For estimating the population total 7, = Y,; + Y, + ... + Yj, the Hansen and Hur- 
witz (1943) strategy consists of selecting with replacement n population units with probabili- 
ty proportional to X;, and using the unbiased estimator 


mrte pp, = X,/7,, 7, = X; + X, + ... + Xy, and y, (r=1, 2,..., n) represents. the 
outcome at the r-th draw. It is easy to show, noting that L Z;=0, 


th ye 1 
Var(tyH) = iB rs ©) 


i=1 


Mere 7, = Y;,— ply, i=l1,.2,. ..,.N. 
When population units are selected without replacement, Horvitz and Thompson (1952) 
proposed the unbiased estimator 


BS.G. Prabhu-Ajgaonkar, Department of Mathematics and Statistics, Marathwada University, Aurangabad 431004, 
India. 


Cae Prabhu-Ajgaonkar: Comparison of Sampling Strategie; 


where 7; (i=1, 2, ..., N) denotes the probability of including the /-th population unit U 
in the sample. Further, when 7, is proportional to X;, the sampling procedure is termed a1 
IPPS scheme. For such a sampling procedure, 


N 
The Tj 

Var (tyr) = Y) — + 3 ZZ; = (2 
ial i Lai : ‘DiP 

where Z; is given in (1), and a (i4#j=1, 2, ..., N) represents the joint probability of in 


cluding the i-th and j-th population units in the sample. When an IPPS procedure is specified 
m;; can be further simplified. 
From (1) and (2), 


N 
oy 
@ = Var(tar) — Var(tyy) = a ZL; — @ 
eee ee 


2. COMPARISON OF STRATEGIES 


Midzuno (1952), Sen (1952) and Sankaranarayanan (1969) proposed IPPS samplin 
schemes for estimating 7,, using the Horvitz-Thompson estimator ¢;,;7. The Midzuno-Se 
scheme is feasible if 


Xx; n—1 
= = ; 
Tie air ONTER) 


Sankaranarayanan’s scheme requires the weaker condition 


Dire ete 2b) ON t)afor all si chc. 


JES 


For both the schemes, the joint inclusion probabilities are given by 


n(n — 1) im ] 
oes Me ai) 


T ij = 


Hence, from (3), 


n(n—1 1 1 Nae Te 
= ———— | + ——— = . § 
Se | Cs mea Teele) | 


The above expression is nonnegative if 
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in which case the Horvitz-Thompson strategy is superior to the Hansen-Hurwitz strategy. 
The above restriction on X? was first derived by Rao (1963) when n=2 and Midzuno-Sen 
scheme is employed, but it is interesting to note from (5) that the restriction remains the 
same even when n is greater than 2. 

Chaudhuri (1975) and Mukhopadhyay (1975) independently derived the above for the 
Midzuno-Sen scheme. 

Brewer (1963), Rao (1965) and Durbin (1967) proposed different IPPS schemes, for the 
case n=2, with the same inclusion probabilities, 


2n-p,; 1 1 N 
T= aval + ) here & = ye ps 


These schemes are free from the restrictions on the p,’s of the previous schemes. From (3), 


so that the Hansen-Hurwitz strategy is again inferior to the Horvitz-Thompson strategy. 
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In This Issue 


Four of the nine papers in this issue deal with Census Coverage Error. These papers and others 
that will appear in the December 1988 issue of the Journal are valuable additions to the rapidly 
growing literature on this topic. Kirk Wolter’s initiative was very helpful in arranging for these 
special sections. 

Census counts are known to be inaccurate due to coverage error and this problem has recently 
attracted a great deal of attention among both policy makers and statisticians - academics and 
practitioners alike. Consequently, methods of measuring the quality of census counts including 
the limitations of such methods, adjustment techniques (both design and model based) to improve 
the quality of population figures, the impact of an undercount on various government programs 
and other related studies have assumed increasing importance. In many countries, evaluation 
studies to measure coverage are carried out during or following each census. In Canada, for 
example, the Reverse Record Check is the most important study undertaken to measure census 
undercount. Similarly, in the United States since the 1950 Census, a Post-Enumeration Survey 
(PES) has been one of the important vehicles used to evaluate census coverage. 

In 1986, the U.S. Bureau of the Census carried out a study called Test of Adjustment Related 
Operations (TARO) in Los Angeles to test a new PES design. Three papers in the special sec- 
tion - those of Diffendal, Schenker, and Hogan and Wolter - thoroughly evaluate the methods 
and procedures used in this new PES, and provide an in-depth analysis of research findings, 
as well as the issues and achievements of the TARO. Diffendal presents an overview of the test, 
describing its methodological and operational aspects. His paper also contains a brief historical 
description of coverage measurement studies in the United States and recent events leading to 
the elaborate studies by the U.S. Bureau of the Census. 

Schenker discusses three methods for dealing with missing data: hot deck imputation, logistic 
regression modeling and weight adjustment. The choice of method depends on the type of missing 
data. For example, logistic regression is used to impute values for binary characteristics. Using 
TARO data, the author compares coverage error estimates obtained under different imputa- 
tion models. 

Hogan and Wolter present a detailed discussion of the potential sources of error in the new 
PES estimates and assess the impact of individual error components as well as the overall impact 
of errors on TARO data. Based on their findings the authors conclude that, in practice, the PES 
estimates may be ‘‘more accurate than original census estimates for some areas, with equal or 
nearly equal accuracy for most other areas’’. 

The fourth paper in the special section, Biemer’s ‘‘Modeling Matching Error and Its Effect 
on Estimates of Census Coverage Error’’ deals with the specific problem of PES-Census mat- 
ching. The author considers three increasingly complex models and examines the impact of mat- 
ching on the PES estimates. Implications of the findings for the 1990 Census are discussed. 

The other five papers in this issue deal with errors in foreign trade statistics, design issues in 
multipurpose surveys, stratification of skewed populations, the Survey of Income and Program 
Participation conducted by the U.S. Bureau of the Census and personal computer software for 
variance estimation in complex surveys. 

In ‘Errors in Foreign Trade Statistics’” Ryten discusses the sources of errors in foreign trade 
Statistics as well as procedures for reducing these errors. He proposes the reporting of the levels 
of uncertainty in detailed figures. The author explains the causes of discrepancies in counterpart 
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trade statistics and analyses their relative importance. Based on the results of a study of the import 
and export data from a World Trade database created at Statistics Canada, the author raises 
serious questions about the comparability of counterpart data at detailed levels of commodity 
classification. A program to improve the quality of foreign trade statistics is proposed and 
arguments are made for providing users with more factual information about data quality. 

In practice multipurpose uses are often made of data obtained from most surveys. However, 
research literature and text books usually avoid the discussion of ‘‘multipurpose sample designs’’. 
This important topic is addressed by Kish in his paper. He first presents a hierarchy of purposes 
and then discusses various conflicting requirements in designing a multipurpose survey. Ten areas 
of conflict, including determination of sample size and its allocation to domains and strata, bias 
to sampling error relationship, choice of stratification variables and continuity of data over time 
are examined. Solutions are proposed for each area and the use of compromise designs rather 
than designs that are optimal for a single purpose is stressed. Some proposals are less rigorous 
and are presented to stimulate further research on this topic. 

An iterative algorithm for the stratification of skewed populations under power allocation 
(an allocation proportional to the stratum total raised to a low-valued positive power) is given 
by Lavallée and Hidiroglou in their paper ‘‘On the Stratification of Skewed Populations’’. An 
empirical study is presented, comparing the suggested allocation with other allocation methods 
using data from the Annual Retail Trade and Wholesale Trade Surveys conducted by Statistics 
Canada. 

The Survey of Income and Program Participation (SIPP) is an ongoing household survey 
conducted by the U.S. Bureau of the Census. In ‘‘Research Issues in the Survey of Income and 
Program Participation’’, Kasprzyk reviews methodological and statistical issues related to the 
SIPP. The paper examines four topics of special interest related to panel surveys of families and 
individuals. These are questionnaire design, data collection, response error and sampling and 
estimation issues for longitudinal concepts. The paper describes the important issues, provides 
references to studies conducted to address those issues and summarizes the main results of the 
studies. 

In the paper ‘‘Personal Computer Variance Software for Complex Surveys’’, Schnell, Ken- 
nedy, Sullivan, Park and Fuller describe a program called PC CARP, developed to analyse data 
from complex surveys. This program has found applications, in particular, in many developing 
countries. The features and capabilities of the system are briefly described. 


The Editor 
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Errors in Foreign Trade Statistics 


JACOB RYTEN! 


ABSTRACT 


In spite of the comparative ease with which studies of error in foreign trade statistics could be conducted, 
there are few attempts to quantify their size, origin, distribution, and change over time. Policy makers 
and trade negotiators have little notion of how uncertain these statistics are in spite of their great detail. 
This paper takes advantage of a World Trade Database developed by Statistics Canada to examine and 
quantify discrepancies in existing foreign trade statistics. 


KEY WORDS: Foreign trade; Bilateral trade balances; Errors. 


1. INTRODUCTION 


This paper discusses some of the underlying causes of errors in foreign trade statistics; 
difficulties in detecting errors; ways of conveying the uncertainty in the detailed figures; and 
a proposal to improve the quality of the data. 

There has not been much written about error in foreign trade statistics since Allen and Ely 
(1953) co-edited a book on these statistics thirty five years ago. Some attention has been paid 
to accounting matters — inclusions and exclusions, demarcation of boundaries, valuation, 
etc. (United Nations, 1982) — and most of all to classification. In fact, one of the biggest 
changes in trade classification ever has just been introduced (United Nations 1986) in order 
to make foreign trade data more comparable among countries. But perhaps because these 
statistics rely on a complete accounting of all merchandise transactions that take place across 
borders in any period of time and this accounting is enforced by a policing agency — customs 
administration — there is a widespread belief that there is not much measurable error left. 
The lack of analysis of error in these statistics supports this contention. 

Periodically, it has come to the attention, particularly of statistical offices in international 
agencies, that there is a serious error in the reporting of trade between pairs of countries. At 
its eighteenth session, the United Nations Statistical Commission (1974) was formally informed 
of the reconciliation of trade statistics between the United States and Canada. This followed 
the detection of some embarrassing differences in the bilateral trade balance between the two 
countries. Thereafter, and at various times, issues involving Singapore and Malaysia, Singapore 
and Indonesia, and any of anumber of non-EEC countries and the Netherlands were brought 
up for discussion at international agencies that were more specifically interested in trade matters. 
Moreover, countries which felt that they were losing control over the quality of their foreign 
trade statistics — typically third world countries — have attempted to piece back their own 
numbers by reference to those of their principal trading partners. But there is no evidence that 
any of these expressions of concern has ever resulted in a systematic programme to detect, 
measure and reduce error in the underlying statistics. 

There are few obvious alternative explanations for this lack of action other than the belief 
that there is no error. Foreign trade statistics are among the very few where there can be a 
comparison of two measurements of the same transaction derived in virtually the same detail 
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using the same procedures, by two independent record takers. The differences that result when 
these comparisons are made have been referred to in the literature going back to almost the 
first world war (Coats 1926). And yet, they have not resulted in proposals to incorporate the 
results of these comparisons in any report on the quality of the underlying statistics. One of 
the deterrents to pursue these comparisons systematically may have been the volume of com- 
puting they entail and the expense involved. Another may be the depth of knowledge that is 
required of counterpart statistical systems which, in addition to being described in some 
instances in a foreign language, usually involve very specific administrative and legal provisions 
which are not comparable from country to country. 

The deterrents to systematic comparisons have changed somewhat in Statistics Canada where 
a world trade data base has been established. Its contents are detailed trade statistics of the 
countries that report data in machine-readable form to the United Nations Statistical Office 
(UNSO). UN member countries undertake, under the terms of membership, to report a number 
of key statistics to the UNSO in the manner specified by the UN Secretary General. These 
statistics include foreign trade statistics broken down by country and commodity, with the latter 
in either the full detail of the Standard International Trade Classification (SITC) or its 
equivalent Customs Cooperation Council Nomenclature (CCCN). Annual reports in machine- 
readable form go back to the early sixties. 

The world trade data base was created to support Canadian negotiators involved in the 
current round of multilateral tariff reductions and also to help Canadian exporters and 
importers get a better understanding of the markets and suppliers with which they deal. Its 
shortcomings are that it is not complete. The centrally planned economies either fail to report 
or else only provide very aggregate data; many of the third world countries experience serious 
delays in processing their Customs records as a result of which there is still much missing in 
recent years; not all countries report on the same vintage of the SITC; and there is a fair amount 
of variation in the concepts and definitions adopted by different countries. 

But these shortcomings are more than offset by the fact that the computing involved in 
comparing trade statistics is now manageable; that a very large proportion of world trade only 
involves the western countries and is reported currently; and that the latter have moved to pro- 
gressively more comparable conceptual frameworks. Taking these elements into account, a 
world trade data base can be used to display the results of comparing counterpart trade statistics 
and this in turn should help statistical agencies to become more conscious of the strengths and 
weaknesses of their merchandise imports and exports data. This is a necessary condition to 
improve the reliability of trade statistics. Given the attention that is currently paid to these data, 
statistical agencies throughout the world are well advised to make the improvements suggested 
by bilateral comparisons of counterpart data even if they can only do so gradually. 

In the next sections, there is a review of the principal causes of discrepancies in counterpart sta- 
tistics and of what steps can be taken to estimate their relative importance in particular situations. 


2. TRADE TRANSACTION RECORDS: ERRORS AND DIFFERENCES 
IN COUNTERPART RECORDS 


Underlying two counterpart trade records, there is, in most cases, one single documented 
transaction. An exporter has made a sale and invoiced the purchaser accordingly. That invoice 
is likely to contain the essential facts about the transaction which includes a description of the 
product(s) sold, the corresponding value and quantity, the terms and conditions of the sale, 
an identification of the purchaser and of the purchaser’s residence and a date on which the 
transaction took (or will take) place. This record generates a number of related records, some 


Survey Methodology, June 1988 5 


derived by transforming the basic information in some prescribed manner and others through 
record linkage with related records. Examples of the latter include a description of how the 
products transacted were moved from the place of sale to the place of purchase and how much 
that cost, the cost of insuring the shipment, what amounts were charged to the two parties to 
the transaction because of duties, sales taxes, consular charges etc.; and of course, the form 
and date in which the purchase was settled. 

The transformations of the basic information have to do with conventions regarding the 
way in which this basic information is recorded and the documentation of the different stages 
of the transaction over time. These transformations are not standard across countries. The 
conventions that rule them are either embodied in Customs law or else in the administrative 
regulations that govern Customs record keeping. They give rise to the documents that form 
the basis of foreign trade statistics. One set of documents is kept by the country of sale; and 
the other by the country of purchase. In practice these documents differ in spite of relating 
to what is in principle and in fact the same commercial transaction. 

Firstly, they differ in time. Even between adjacent countries or in cases where air transport 
is involved, differences in time are not trivial. They arise because the chain of links that make 
up the transaction is long — bringing the shipment to the point from which the international 
carrier will depart; warehousing while waiting for international transport; arriving at the point 
of destination; warehousing while waiting to clear Customs formalities; and while this is going 
on, filing documents at different stages and having them recorded on the basis of different 
conventions. Also, in one country the time of transaction may be recorded as the time the invoice 
is received in the importing country and in another as the time amounts owing to the Customs 
administration are paid. 

Secondly, in one country the recording of the value of the purchase may include all costs 
of international transportation and insurance; whereas in another these may be kept separately. 
Thirdly, in one country the transaction may be imputed not to the country from which the 
invoice was issued but rather to the country where the product was grown, extracted or manufac- 
tured; whereas in another, it is the residence of the seller that decides the country assignment. 
Political stances can also affect the way a country is identified on the records. Fourthly, customs 
regulations can bias the way imports or exports are recorded. Fifthly, there are data coding 
and processing errors. And finally, the units in which the quantities are reported can cause 
inconsistencies. The following sections provide additional detail on these factors. 


i) Differences between exports and imports records: timing 


Customs administrations will normally file records in a variety of ways: by country of origin; 
by the identification of the importing business or its agent; and by time of receipt. But there are 
at least four key events involved in an import transaction all of which may be recorded but only 
one of which will be chosen as the date for retrieval and statistics. The choice of the date is not 
subject to statistical standardization but rather to how customs views its prime function and to 
the technical capacity to store alternatives. Clearly, if one country chooses as its date to record 
exports the time when the forwarding agency prepares an export document; and the counter- 
part country chooses as time for imports the date when all duties and other dues are settled, the 
possible lag between the recording of exports and the corresponding imports is a maximum. 


ii) Differences between exports and imports records: values 


Value differences have long stood in the way of systematic comparisons so it is best to review 
them and assess their relative importance. The valuation of the transaction that is to say, the 
price at which it is recorded for purposes of customs administration — is critical. Many coun- 
tries (most?) record the value of an import including the cost of international transport and 
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insurance relating to the shipment. Most countries record the value of the counterpart exports 
excluding these components. There are additional variations: some countries include portions 
of inland transport and insurance and some countries exclude harbour costs from costs of inter- 
national transport. But these differences only present a marginal increase in the difficulty of 
comparing counterpart records. Transactions involving related commercial partners as in the 
case of multinational enterprises trading internationally pose a problem of valuation which 
is solved in different ways in different countries. It is possible that this source of difference 
will outstrip all others in the years to come. 


iil) Differences between exports and imports records: country 


There is the matter of country crediting which can introduce some of the more puzzling dif- 
ferences in any systematic programme of comparisons. As an exporter, a country can count 
as an export any sale of goods that has to cross its customs boundaries to reach its point of 
destination, independently of whether it was substantially changed or is being sold in the exact 
same form in which it was purchased from some other country. However, as an importer a 
country may decide to impute a purchase to the country where the last substantial transfor- 
mation (normally ‘‘substantial’’ has a precise definition in law) took.place. Accordingly in the 
case of three hypothetical countries. A, B, and C where A has exported some goods to B and 
B has exported the same goods (perhaps transformed) to C, the statistics may be recorded in 
any of many possible ways with different consequences, as shown in the table below. 

The symbols ‘‘x’’ and ‘‘m’’ denote respectively value of exports to and imports from the 
partner country (second upper case letter) as recorded by the reporting country (first upper 


case letter). 
Accordingly, 
A,B = Value of exports from A to B as recorded by A 
A,,B = Value of imports from B to A as recorded by A 


Recorded as 


Recorded as 


exports imports Consequence 

i) A,B + B,C ByA + CB Consistent and 
complete 

ii) Ai beck BoC BA + CA Overcrediting of A 
by importers 

ili) AyG, +) ByG ByA + CyB Overcrediting of C 
by exporters 

iv) Ap.G: CrA Consistent but 
incomplete 

Vv) Ae G),B No crediting of A 
by importers 

Vi) A,B Ci,A No crediting of C 


by exporters 


The different cases indicate that some reporting countries credit their exports to the first 
and others to the last known destination; that some importing countries credit imports to the 
country of origin and others to country of consignment; and some exporting countries count 
as exports whatever leaves their national territory irrespective of the degree of transformation 
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to which the goods may be subject. The differences involved in these approaches are not trivial 
matters in days of free trade agreements, Customs unions, free trade zones and other arrange- 
ments to stimulate transborder trade. For each of these arrangements, a separate statistical 
convention is needed to accommodate the effect of the agreement on customs record keeping. 
Crediting partner countries in inconsistent ways is only one source of discrepancy in bilateral 
or multilateral comparisons. The other is due to inconsistent geographic classification. 

In fact, many countries embody their stance in international politics in their standard geo- 
graphical classifications. Accordingly, there are differences that arise from inconsistent geo- 
graphic definitions of partner countries. Most Latin American countries treat Puerto Rico as 
a separate origin or destination from the United States. Virtuall yeach OECD member country 
has a different treatment of partner countries in Africa. Some lump them together by their colo- 
nial origins and others by geographic neighbourhood. Similar inconsistencies arise in the treat- 
ment of the Caribbean and South Pacific islands. The Economic Union of South Africa is treated 
in the statistics in ways which often reflect the reporting country’s view of an embargo on com- 
mercial ties with South Africa itself. Moreover, not all countries track the changes in the political 
status of their trading partners with the same zeal so that not all catch up with newly created 
independent nations as quickly as desirable in order to conduct statistical comparisons. 


iv) Differences between exports and imports records: Customs administration 


There is another important difference that arises because the attention paid to exports by Customs 
administrations is less than what their mandate requires they pay to imports. The reporting of indi- 
vidual exports shipments may be consolidated in the interests of paper burden and brought into 
line with the manifests or other transport documents handled by the carrier. In the case of imports, 
the objective is to get reporting in sufficient detail to allow Customs to apply the right duties and 
other taxes. One consequence is that in the case of exports, low value components of a mixed ship- 
ment are more likely to be classified under the same heading as the major component whereas in 
the case of imports the chances are that they will be classified independently. 

This difference in interest that can be ascribed to the mandate of a Customs administration 
has other substantial effects on the quality of exports and imports documents. On the one hand 
there is evidence that the extent of underreporting of exports which affected United States 
overland exports to Canada is not confined to North America. Almost twenty years ago the 
United Kingdom launched a massive programme that consisted in matching shipping manifests 
to export documents because of a perceived rate of underreporting of some one to two per 
cent of the total. On the other hand, there is a presumption that the description of exported 
products is unbiased (unless it covers up illegal shipments) whereas the descriptions of imported 
goods may be biased because they aim at minimizing the rates of duty for which the imports 
are liable. 

In addition to these sources of difference, which are due to the different legal and 
administrative transformations to which the original record is subject, there are others which 
are more variable and more selective in terms of the records to which they apply. Examples 
are the treatment of low value shipments (they are defined as below different thresholds and 
are excluded, included, or sampled with varying rates) and the treatment of commodities that 
have important service elements such as recorded audio and video tapes, architects’ blueprints; 
computing software recorded on magnetic tape; repairs and maintenance etc. 


v) Differences between exports and imports records: coding and data processing 


Virtually all classes of information that are included in the basic records kept by Customs 
reflect the application of a classification or a code to an actual situation. The way to ensure 
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consistency of coding is by ruling on borderline cases and ensuring that the accumulated rulings 
form something akin to case law — a body of decisions to be made accessible to coders and 
by which they should be governed. But the only central dispenser of rulings is the Secretariat 
of the Customs Cooperation Council in Brussels and it can neither be consulted by member 
countries on a day to day basis nor can its decisions go beyond a certain level of generality. 
For this reason, there are systematic differences in interpreting and applying standard codes 
sometimes within the same country, let alone among different countries. 

In addition, there are inconsistencies due to errors at the data processing stage and as a con- 
sequence of the systems put in place to reduce their impact. For example, there are errors in 
interpreting Customs legislation and in coding source information that creep in at the stage 
when importers or exporters inform their authorities of an impending shipment; errors at the 
stage of data capture; and errors of coding within the statistical agency. The standard protec- 
tion against these errors is the institution of review and editing systems that rely to differing 
extents on clerical inspection and review and on computerized detection and imputation. 
Although it is very likely that there are other sources for inconsistency, the issues reviewed above 
are the most frequently cited ever since these matters were first described in the literature (Coats 
1926), and probably are the most important explanations of the differences in counterpart 
figures. 


vi) Differences between exports and imports records: quantities, a special variable 


Unlike values, reported quantities are not affected by the inclusion of transport costs nor 
are they biased in order to minimize tax liabilities (although if values are miscoded to lower 
duty categories they will drag the matching quantities along). Unfortunately, there are other 
problems associated with the recording and use of quantities that greatly reduce the value of 
these statistics for error detection. For example, quantities can apply to either an entire ship- 
ment in which case they are usually expressed as a gross weight or else to a specific commodity 
in which case they are expressed as either net weights or in any other appropriate unit (length, 
surface, volume) including, in the case of complex commodities, numbers of units. 

While quantity measurements in gross or net weights are comparable across countries, their 
use is limited by the heterogeneity of the shipments to which they refer. Quantities expressed 
in other units are limited by the the variety of units used and, more importantly, by the fact 
that they cannot be aggregated in the commodity classification and the levels to which they 
apply are much too detailed for inter-country comparison, given our current state of knowl- 
edge. Nonetheless, there is a use for these units in matching trade in raw materials particularly 
if in conjunction with values, they are used to track changes in unit values. In fact, a proposal 
for an international study of errors in trade relying chiefly on the matching of unit values was 
made to the eighteenth session of the United Nations Statistical Commission (1974). However, 
member countries did not feel the possible benefit justified the expected cost. At this stage, 
Statistics Canada’s world trade data base does not include quantity information so that the 
applications of quantity statistics have not yet been studied. 


3. A PROGRAMME TO MEASURE ERRORS 


The causes of errors have been known for many years (Coats 1926). A proper attempt at 
quantification was made in the first reconciliation project between the United States and Canada 
in the early 70’s. But to this day that is a very large proportion of what is known about errors 
in the foreign trade statistics and obviously suffers from the fact that it concerns trade between 
two adjacent countries and only those two countries. Given the fact that international data 
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bases such as the one that Canada has will likely become more popular and that they will be 
provided with a variety of analytical software, it is timely to speculate on what might be done 
to improve trade statistics, or failing improvement, at least to inform users about the limita- 
tions of foreign trade data. It is not likely that at this stage, with the descriptive information 
that is currently available, users in any country realize by just how much the long term trends 
in trade statistics might be off, or how the monthly movements in their national trade balances 
are affected, and most important, how prone to error is information at detailed commodity level. 

Clearly, the flow A,B should be the same as B,,A so long as all shipments and their re- 
cording is instantaneous, the basis of valuation is the same for the two partners for the same 
transaction, the rules of inclusion and exclusion are the same, there are no conceptual dif- 
ferences (geographic, accounting, or due to Customs regime) and there are no errors (of coding 
or coverage). Included in ‘‘errors”’ are consistent interpretations of the classificatory schemes 
by one country which would be disputed by other countries or by the Customs Cooperation 
Council. 

In principle, all sources of differences other than errors should be tractable although 
measuring the relative importance of different sources can be difficult in practice. A review 
of the different sources or factors is useful in order to consider how their effect can be accounted 
for in any comparison. Of these factors, transportation is probably the least difficult to deal 
with and almost certainly the least difficult to do something about. There are a number of coun- 
tries such as the United States where imports are measured both ways: including and excluding 
transport. In principle, the information to estimate the cost of insurance and freight (c.i.f.) 
component across the board is available. Importers are legally bound to inform their Customs 
authorities of all their expenses in connection with a purchase abroad and the two broad cate- 
gories of expenses are those that are dutiable (usually those connected with the product itself, 
including its packaging or wiring or mounting) and all others (usually those connected with 
the transportation, insurance and financing of the import). Accordingly, if it were necessary 
to conduct a study of transportation costs, there are administrative records which could be 
linked to the corresponding trade records. There are many technical problems related to how 
shipping and insurance information should be assigned to individual commodities in the case 
of complex shipments but there are proposals for ways to deal with these matters (Ryten 1983). 

Equally, in principle, a study could be made of timing differences in the context of a par- 
ticular flow of trade between any pair of countries. In the case of the reconciliation of trade 
Statistics between the United States and Canada estimates were based on actual matches of 
documents which made it possible to compare dates and estimate average time lags between 
exports and corresponding imports. But there are less expensive methods to arrive at rough 
estimates that are also less constraining from the point of view of access to confidential records 
and are reasonably effective to calculate broad ranges of timing differences by points of exit 
and entry, by mode of transport, and by commodity. 

Together, the estimates of timing differences and the difference between the cost of insurance 
and freight and free on board valuations (f.0.b.) can be expressed in the following equation: 


A,,B(kK) = B,A(k) + A (c.1.f.) B(k) + 6+ e 


where A,,B(k) is the flow of imports for commodity k from country B to country A as re- 
corded by Country A; B.A (k) the counterpart flow as reported by country B; A (c.i.f.) B(k) 
the estimate of transport and insurance costs for that flow of trade as derived from country 
A’s records; 6 a timing adjustment and e an error term that includes all the biases and random 
errors that affect both imports and exports statistics. It is assumed that all other sources of 
difference (geography, inclusions and exclusions, low value shipments etc.) have been disposed 
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of either by adjusting for them or preferably by excluding all transactions that may be affected 
by these factors from the comparison files. Over time, the average error should tend to zero 
and therefore the longer the period over which the comparison is made, the closer to each other 
the level or the average rate of change of the figures being compared. Should a comparison 
suddenly yield perverse results, this would constitute prima facie evidence of a deterioration 
in quality of at least one of the two terms of the comparison. 


3.1 Analysis Using the World Trade Mini-Database 


For purposes of analysis, a mini-database derived from the world trade database was created 
so as to start studying some of these effects. It covers the three principal trading blocs of the 
Western world: the EEC defined for these purposes as excluding Portugal and Spain; North 
America (Canada and U.S.A.); and Japan. Besides being simpler to use because of the reduced 
number of records, it avoids the problem of late reporting (mainly by third world countries) 
and of non-reporting (mainly by centrallyplanned economies). The mini-database includes 
exports and imports data for each of the constituent countries broken down by SITC (down 
to the four digit level of detail) and by partner country, from 1978 to 1985. In addition to the 
constituent countries, it includes two aggregates — the EEC and North America. Unlike the 
world trade database which includes a number of imputations to make analysis simpler, the 
mini-database only includes data as member countries reported them to UNSO after UNSO 
merged categories of trade deemed secret by the reporting country and converted non-standard 
codes reported by countries to standard SITC codes. None of these transformations is likely 
to affect the findings derived from the database in a significant way. 

There are a few statistical problems with the grouping of countries in the mini-database. 
The United States has been reporting its imports to UNSO on the basis of c.i.f. but Canada 
reports imports f.o.b. Whereas the United States credits its partner countries on the basis of 
the origin of the imported goods, Canada reports on the basis of consignment (except for 
imports originating in Latin America). This in itself would not be too serious but for the fact 
that the United States is at times credited for exports routed to Canada. Accordingly, while 
the addition of the two countries should improve the matching of counterpart flows, the dif- 
ferent systems of recording make it so much more difficult. Hopefully, this drawback will be 
overcome when United States f.o.b. imports are added to the base and when Canadian imports 
by origin replace imports by consignment for as many back years as possible. 

In the case of the EEC countries, the key role that the Netherlands plays as port of entry 
to its European hinterland makes comparisons difficult. The Customs area of the port of 
Rotterdam acts not only as a giant distribution centre but also as a warehousing facility for 
the countries it serves. Accordingly, the exporter outside the EEC may not know to which 
specific country the sale is made but only that it will be warehoused in Rotterdam and for this 
reason credits the Netherlands with the sale. But the ultimate importer is bound by the rule 
of origin to assign the purchase to the correct country. As for the Netherlands, according to 
its records, no transaction involving goods has taken place across its Customs boundaries. It 
has simply sold harbour and warehousing services to either one of the transactors. 

If the Netherlands served only the other members of the EEC as a port, the creation of an 
EEC total should suffice to improve the comparisons. But other countries (Switzerland and 
Austria in particular) also benefit from Dutch harbours and container terminals. This com- 
plicates matters somewhat because for example, the Swiss importer might apply the rule of 
origin to the Netherlands in cases where there has been a consolidation of imports from many 
origins. Or some value added operation performed outside the Customs zone in Rotterdam 
may not be reported as Dutch foreign trade in merchandise. 
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Another obstacle to interpretation is provided by the two Germanies — given that one fails 
to report its imports to UNSO and the other does not regard as exports the transactions it con- 
ducts with its Eastern counterpart. This means that there are extra-exports by the EEC that 
have no counterpart import records and, more specifically, that there are unreported trade tran- 
sactions between the two Germanies. The size of this unrecorded leak varies with the relative 
affluence of East Germany and can only be surmised by looking at other indicators. There are 
also leaks that affect trade with Japan that will affect the results of comparisons involving J apan 
and its partner countries. These may be created by operations involving branches of J apanese 
firms located in S.E. Asia. However, the effect of these cases on aggregate data is not likely 
to be substantial and should not detract from the value of the analysis using this database. 


i) Comparison of growth rates of counterpart statistics 


Among the analyses conducted on the basis of the mini-database, one involved comparing 
growth rates in counterpart statistics, taking the period 1978-85. The assumption was that over 
that time period, the effect of errors and timing differences would be sufficiently attenuated 
so that the more permanent effects could be recognized. Moreover, by looking at growth rates, 
the effect of different valuations would be avoided to a considerable extent. The likelihood 
is small that the change in the cost of insurance and transportation is sufficiently different from 
the change in the average prices of the commodities transported to affect growth rates substan- 
tially over a period of three or four years. At least in the case of manufactured goods the pro- 
portion of transportation and insurance in the total cost is well below 10 per cent as borne out 
by United States ratios of f.o.b. to c.i.f. Moreover, transport costs would be only related to 
the weight and volume of the goods transported. Insurance costs, which are related to value, 
do not represent a significant proportion of total cost. And inter-transport mode substitution 
is unlikely to add to total cost in any other than exceptional circumstances. Accordingly, if 
the change in the corresponding cost were sufficient to affect import growth rates relative to 
counterpart export rates, the effects should be all in one direction and their size should vary 
with the average bulk of the commodities transported. 

These speculations are only partly borne out by fact. Table 1 shows the differences in annual 
growth rates for counterpart total trade for the pairs of origins and destinations derived from 
trade among the EEC, North America, and Japan. While relatively small, these differences 
do not suggest any pattern though there may be some underlying regularities that escape super- 
ficial inspection. 


Table 1 


Differences in Growth Rates for Counterpart Annual Total Trade 
for Japan, North America and the EEC, 1978-1985 


Difference in growth rate for the period! Difference in value of 
Country A - Country B ———— SSS —— exports in 1982 in 
1982/78 1985/82 1985/78 millions of dollars? 
N.A. - EEC 6 —.5 —.5 265 
N.A. - Japan — 4 5S) - ~ 
EEC -N.A. — .8 —.7 — 8 365 
EEC - Japan 14 1.9 1.5 90 
Japan - N.A. —.7 —.2 —.5 200 
Japan - EEC —1.2 — .6 —.9 155 
Mean Absolute Difference 8 ay am, 


' Defined as percent growth in A,B less percent growth in B,,A. 


? Difference between A,.B and B,,,A rounded to the nearest five million dollars. 
A dash (-) denotes an insignificant value. 
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Table 2 


Differences in Growth Rates for Counterpart Annual Total Trade by SITC Section 
Japan in 1978-82 and 1982-85 


Japan - North America Japan - EEC 
‘ Difference in growth Difference Difference in growth Difference 
SITC Section rate for the period! in value rate for the period in value 
ae Dieta i ee aaa aa of exports a eee of exports 
1982/78 1985/82 in 19822 1982/78 1985/82 in 1982 
5. Chemicals 5S) —1.6 15 —1.5 > 5 
6. Semi- 
manufactures —2.5 9 60 We —.5 5 
7. Transportation 
equipment —1.0 -—1.0 phy —2.0 —.7 85 
8. Micellaneous 
manufactures 1.4 —.9 35 — 8 8 20 


' Defined as percent growth in A,B less percent growth in B,,A (A is Japan). 
* Difference between A,B and B,,,A in millions of dollars rounded to the nearest five million dollars. 


Table 2 shows growth rates for selected SITC Sections between Japan and its two trading 
partners. The principle involved in simplifying Table 2 was to ignore flows with less than one 
million dollars in 1982 since such flows do not appear to be sufficiently stable to warrant inter- 
pretation. 

Discussions about internationally comparable commodity classifications have invariably 
demanded more rather than less detail. The collection of statistics for purposes of international 
comparison has induced countries to publish data well beyond the 3-digit of the SITC or its 
equivalent. A number of third world countries publish data broken down by ten digits corre- 
sponding to nationally-annotated international classification, and country. Inspection suggests 
that flows coded at one digit — where there has seldom been any controversy — are subject 
to very considerable differences when compared with their counterparts as soon as their absolute 
value drops to, say, below 50 million dollars. Beyond the first digit of the classification, 
differences rise very rapidly. 

The case of Japanese exports to North America and counterpart imports shown in both tables 
1 and 2 warrants further consideration. At mid-point (1982) this trade was valued at about 
forty billion dollars (US). Total imports grew on average by half of one per cent per annum 
more than exports. This is an amount of about two hundred million dollars per annum at mid 
point. Detailed examination suggests that a substantial part of the explanation lies with section 
7 of the SITC which includes inter alia all types of transport equipment. There the difference 
in growth rates is of one per cent per annum on average. It would be interesting to pursue this 
investigation to determine whether the discrepancy is evenly distributed or whether its incidence 
is chiefly felt by one particular commodity. 

But whatever the causes, these comparisons suggest that over a sufficiently long number 
of years and for comparatively large portions of total trade flows, differences in growth rates 
are not large in absolute terms. Notwithstanding this observation, even small differences could 
play havoc with period-to-period changes in the overall trade balance, particularly when it is 
close to zero. Moreover, when dealing with a trading partner such as Japan, with exports heavily 
concentrated in one or two one-digit breakdowns of the commodity classification, the 
possibilities of compensation for systematic misclassification are comparatively few. This makes 
it all the more important to understand why bilateral trade as measured by the two counter- 
part reports has not been moving in step. 
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Table 3 


Changes in X/M Ratios Between 1978 and 1985 and Comparisons with Standardized 
X/M Ratios Assuming Constancy of SITC Section Shares! 


North America EEC Japan 
Simple Std Simple Std Simple Std 
Ratio Ratio Ratio Ratio Ratio Ratio 
1978 1985 1985 1978 1985 1985 1978 1985 1985 
North America .90 a) .92 .85 .86 .69 
EEC .96 92 .69 .78 .86 89 
Japan 95 98 91 1.00 .94 .86 


! The simple ratio is X/M = (A, B/B,,A ). The standardized ratio using common shares is 


: I Xit 
Std ratio = —— is —. M78, 
78 hae it 
where m;, = current imports for section / of the SITC (i = 0,1, ..., 7), 
M7, = imports in 1978 for section / of the SITC, 
Xj, = current exports for section i of the SITC, and 
M7, = total imports in 1978. 


ii) Comparison of the ratios of annual exports to imports 


A different kind of analysis was also very revealing. Any import flow should be equal to 
the counterpart export plus the cost of freight and insurance plus some term which reflects 
the sum of conceptual differences, timing, and errors. Whereas timing and errors should make 
their impact felt mostly in the short term, conceptual differences should emerge as the domi- 
nant influence in the longer term. For this reason, if the ratio of annual exports to annual 
imports changes over time this can be due to a combination of the following factors: because 
of achange in the shares of relatively high c.i.f. to low c.i.f. components; because of a change 
in the mix of commodities with small relatively to commodities with large-timing differences; 
because of a change in the proportion of c.i.f. to total value; and because of other factors. 

Table 3 shows some aggregate results of this analysis. Against each of the flows involving 
Japan, the EEC and North America, there are three figures: the simple (current year weighted) 
ratio of aggregate exports to aggregate imports in 1978, the corresponding ratio in 1985 and 
the standardized base year weighted ratio assuming that the proportions of imports by sec- 
tion to total imports for each flow of trade remained constant since 1978. These standardized 
ratios are an approximation to an estimate that removes the impact of variations in the mix 
of c.i.f. from the variation in the ratio over time. Any difference between the 1978 and the 
standardized 1985 ratios should therefore be ascribed to other factors. 

There are expectations about the way ratios should change over time as a result of the 
increased share of highly manufactured goods in certain export flows. For example, exports 
by the EEC to North America and Japan; exports by Japan to the EEC and to North America 
can be expected to include proportionately more manufactures. Accordingly, the ratio that 
reflects changes in mix is higher than the standardized ratio. This follows because the relative 
importance of c.i.f. decreases as the value of a unit of weight or volume increases. 

But there are a priori exceptions to this prediction shown up by the table. For example, the 
exports of North America to Japan show a very large gap between the simple and the standard- 
ized ratios even though the share of manufactures went up relatively less. 
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Table 4 


Variations in Simple x/m Ratios Between 1978 and 1985 Compared with Standardized 
x/m Ratios with Constant SITC Division Shares 


Exports from”... N.A. EEC EEC Japan N.A. Japan 
LOpre EEC N.A. Japan EEC Japan N.A. 


SITC Sections 


0 Food 107 102 100 98 98 90 
1 Beverages & tobacco 99 100 99 100 99 100 
2 Crude materials 100 100 96 93 100 102 
3 Mineral fuels 102 117 304 93 103 109 
4 Animal & veg. oils 99 98 107 100 101 92 
5 Chemicals 102 101 101 97 100 96 
6 Manufactured goods 99 101 99 96 98 100 
7 Machinery & transport 96 9] 95 97 92 100 
8 Misc. manufactures 100 100 100 98 oF 99 
9 Misc. transactions 150 176 163 157 86 92 


Table 4 provides a breakdown by SITC sections for the ratios corresponding to trade flows 
between each of six pairs of trading blocks recorded in the mini-data base. The figures shown 
are ratios of the simple index at the Section (1-digit) level to the index derived using share of 
imports at the Division (2-digit) level. They indicate the contribution to the variation in ratios 
accounted for by changes in the commodity mix. They are no more than indicators partly 
because they only go down by one level in the commodity classification. 

(Figures in the table are derived by taking the index that measures the change in each section 
of the simple X/M ratio from 1978 to 1985, i.e., (x/m) 1985 divided by (x/m) 1978 and dividing 
it by a corresponding index in which the standardized (x/m) ratio for 1985 was used and where 
the division ratios were aggregated using their 1978 shares in their corresponding division. 
Simple algebra suggests that the ratio obtained R; is: 

nj 
Rp= 100M age bso ya eka mire: 
i85 yecoutl igs 


Notation is similar to that used in table 3. Subscript i denotes the section and subscript j denotes 
the division within the section (j = 0,1, ..., 7,;). A figure of 104 for example implies that a 
four percent increase in the current value of exports relative to counterpart imports took place 
for reasons other than the effect of changes in commodity mix on the c.i.f. component.) 

No pattern is readily detectable: there are roughly as many cases which overshoot as cases 
which undershoot the mark. For the bigger flows, such as North America to EEC or EEC to 
Japan, the commodity mix is relatively stable as a result of which there is little difference between 
base and current weighted ratios (except for those sections of the SITC where trade is com- 
paratively small as in the case of Mineral fuels exported by the EEC to Japan). Moreover, these 
do not move that much over the period. Other flows are very sensitive to the commodity mix 
which suggests that at lower levels of the classification f.o.b./c.i.f. differences explain a small 
portion of the variation in x/m ratios over time. 
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3.2 Analyses Using the Complete World Trade Data Base 


Potential country and commodity mis-classifications: 


Tables 5 and 6 derived from the complete world trade data base present counts of potential 
country and commodity misclassification. Table 5 presents a count of the number of cases in 
1983 in which there is bilateral trade in a commodity according to one of the reporting coun- 
tries of a trading pair but not according to the other. This is shown for each level of SITC detail 
as a proportion of all cases. Table 5A shows the impact on value, again for each level of the 
SITC. In addition to providing a summary measure of the size of errors, the tables also give 
an idea of how fast the number of anomalous situations increases as a function of the detail 
of the classification. 


Table 5 
Comparison of Foreign Trade Statistics in 1983 - Number of Records! 


Percentage Percentage 


SITC Level of Detail Reporting Reporting ora 
No Exports No Imports 

0 (overall) 11 4 15 

1 digit 14 7 21 

2 digit 16 10 26 

3 digit 19 13 32 


! Percent of number of records of trading pairs with one member reporting no exports/imports 
while other member reports non-zero trade. 


Table 5A 
Comparison of Foreign Trade Statistics in 1983 - Value of Records! 
Percentage Percentage Total 
SITC Level of Detail Reporting Reporting pects 
No Exports No Imports & 
0 (overall) sil - 3 
1 digit ay 1 4 
2 digit 6 4 1.0 
3 digit 191 9 2.0 


! Percent of value of records of trading pairs with one member reporting no exports/imports 
while other member reports non-zero trade. 
A dash (-) denotes an insignificant value. 


Table 6 
Comparison of Counterpart Foreign Trade Statistics in Two Selected Years 

Oo SE OS a ES Rr a hs 
1979 1983 

Number of records with x > m as percent 
of all records 35 2W) 

Value of exports where x > mas percent 
of total exports 41 42 
x/m ratio forx > m 1.18 Lets 


x/m ratio forx < m .87 85 
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Table 7 


X/M Ratios in 1985 
From Three Selected Reporting Countries to Nine Trading Partners 


To From Canada ULSSAS Japan 

EEG: .84 .92 .94 
Netherlands 1.93 1.34 ins 
Belgium - Luxembourg 1.47 | bs Ie 1.26 
Denmark 120% .74 O55 
France .70 (74* .69* 
Germany, F.R. .69 81 .98 
Ireland 2) .78 DR: 
Italy HS .84 thee 
Urke .74 .86 .89 
Greece 1.00 j| aaa}e .89* 


Table 6 shows changes between two selected years in a number of indicators — related to 
cases where exports are in excess of counterpart imports. While over a period of four years 
there has been some change in the percentage of records for which exports exceed imports as 
well as in the percentage value of total exports for those records, the changes in question are 
minor. Surprisingly, the cases of x/m account for more than 40 per cent of the total value of 
trade and as this figure went up fractionally, the proportion of records that accounted for it 
fell by 10 per cent. 

In the case of Table 7 a number of a priori predictions are tested against fact. Three reporting 
exporters — Canada, United States and Japan — and nine reporting trading partners — the 
members of the EEC other than Spain and Portugal are studied. The tables list the 1985 simple 
x/m ratios for country to country trade. Other things being equal, the following predictions 
seem plausible: 


- the higher the manufacturing content of a trade flow, the higher the x/m ratio, which is 
equivalent to saying that the c.i.f./total value ratio is smaller, the more value added is 
embodied in a commodity. For this reason, the ranking in ascending order of ratios should 
be Canada, United States, Japan; 


- in the case of trade with the entrepdt countries — Netherlands, and to a lesser extent, Belgium 
Luxembourg — country miscoding by the exporter should apply mostly to bulk shipments. 
For this reason the x/m ratio in descending order should be Canada, United States, Japan; and 


- x/m ratios greater than one should only occur for entrep6ét countries. 


For thirty x/m ratios (counting in the three ratios for the EEC as a whole) there are nine 
cases (entries with * in table) for which the predictions do not hold. Removing Greece’s two 
because the corresponding trade flows are much too small, seven ratios do not behave according 
to expectations which is still in excess of twenty percent of all cases. 

The critical finding in these analyses is that any increase in the level of detail in the classifi- 
cation hierarchy beyond the combined one makes comparisons with counterpart trade very 
difficult. This is not compatible with the progressive attempts, conducted both nationally and 
internationally, to expand the detail of the commodity classification and to increase the number 
of breakdowns by additional classification variables. Even when pooled over time, the trans- 
actions in these detailed cells match poorly with their counterparts. Since it cannot be argued 
that both reports involved in a bilateral comparison are simultaneously correct, the chances 
are that both contain a significant error component. 
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4. MAKING USERS AWARE OF ERROR 


There are two separate issues. One is to make users aware, that contrary to widespread belief, 
the foreign trade figures, particularly the detailed figures, may be flawed. The other is to put 
together a programme to improve the quality of foreign trade data taking advantage of the 
fact that counterpart measurements of the same transaction exist. A number of proposals to 
get such a programme underway follow. 

The analysis presented in this paper provides that beyond the two-digit level of the com- 
modity classification by country, even annually, neither levels nor year-to-year changes can 
be taken with complete confidence. Users will probably not take kindly to such a finding, as 
they already have reason to question the coverage of aggregates in the case of exports. The 
results of the reconciliation programme between the United States and Canada should not be 
viewed as limited to the two countries. Others experience the same class of problems to a varying 
extent. The revelation that, in addition to these weaknesses, data by commodity beyond a cer- 
tain level can only be used with great caution, could lead to a fundamental change in the percep- 
tion that users have of foreign trade statistics. 

But, if this measure is not taken, no matter how unpopular the news, a belief that has less 
than full underlying factual support is perpetuated. The detailed commodity figures are used 
in a variety of ways and the one that is most topical is for purposes of tariff policy. Discus- 
sions on these matters rely heavily on detailed figures, seldom on the differences between 
national and counterpart data, and equally seldom on domestic consumption statistics as a 
check on the orders of magnitude suggested by Customs data. Moreover, in another use of 
detailed commodity data, views about industrial and regional policy are formed and actions 
may be taken on the basis of evidence which this analysis suggests is not solid. Surely it is incum- 
bent on statistical agencies to make users aware of the perceived inadequacies of the data in 
order to prevent the generalization of their misuse. 


5. A PROGRAMME TO IMPROVE FOREIGN TRADE STATISTICS 


In addition to providing users with more factual information about error in foreign trade 
statistics, a programme or programmes to improve the quality of these statistics over time 
should be formulated. The following are steps which should probably have been taken some 
time ago: 

i) the c.i.f. component of imports should be measured systematically . Without it, it will not 
be possible to compare exports with imports across the board. The information is available 
at the time the import is reported to Customs. Matters such as how often and to which detail 
will depend on resources and on the urgency to improve the knowledge of users; 


an inquiry should be launched into time lags between exports and imports by commodity 
category and by country of origin. To make such a study effective, it is probably necessary 
to count on the co-operation of partner countries; although, if this is not forthcoming, 
reference to commercial invoices may be an acceptable surrogate; 


il 


— 


iii) on the basis of knowledge of these two elements, a formal method to estimate counterpart 
imports on the basis of exports should be used and the error of estimate tabulated for future 
study. If the error of estimate has no significant autocorrelation properties, coding and 
related errors might explain the difference between the recorded import and its statistical 
estimate. If, however, the error term does not satisfy these criteria, it should be marked 
down for future inquiry in co-operation with the partner country; 
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iv) obvious surpluses or deficits should be tested against countries likely to play the role of 
commercial intermediary or entrep6t. For example, an export surplus with the Netherlands 
for the United States should be tested against corresponding deficits with such countries 
as the Federal Republic of Germany or France. Econometric methods can be used to disen- 
tangle an across-the-board effect of entrep6t services (although they are more likely to be 
used for bulky and warehousable merchandise) from short-lived effects such as coding error; 


v) for those commodities which are systematic outliers, after all adjustments have been made, 
either because they persist over time or because they occur across countries, advantage 
should be taken of the Harmonized System by enlisting the help of the Customs Cooperation 
Council for the interpretation of its explanatory notes. 


Obviously the launching of such a programme requires preparation, approval, and resources. 
It cannot take place at once nor will it be sponsored by most countries straight away. But the 
proposals ought not to be shelved as similar proposals were some thirteen or fourteen years 
ago. There is too much attention paid to the trade statistics to risk delaying their improvement. 
Their comparison with counterpart data shows that they can only stand increased attention 
if they are substantially improved or if their analysts become more aware of the limitations 
of the material on which they test their hypotheses. 
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ABSTRACT 


Most surveys have many purposes and a hierarchy of six levels is proposed here. Yet most theory and 
textbooks are based on unipurpose theory, in order to avoid the complexity and conflicts of multipur- 
pose designs. Ten areas of conflict between purposes are shown, then problems and solutions are advanced 
for each. Compromises and joint solutions fortunately are feasible, because most optima are very flat; 
also because most ‘‘requirements’’ for precision are actually very flexible. To state and to face the many 
purposes are preferable to the common practice of hiding behind some artificially picked single purpose; 
and they have also become more feasible with modern computers. 


KEY WORDS: Allocations to domains; Mean-Square-Errors; Multipurpose allocation; Multipurpose 
design; Optimal allocation; Periodic samples; Sample size. 


1. INTRODUCTION 


Most studies involve several purposes during the planning stages and then typically many 
more purposes emerge later during the analyses of data and more during their interpretation 
and utilization. However, the real multipurpose nature of most studies tend to remain hidden 
under the surface of oversimplified, univariate discussions of study designs. This seems most 
clearly evident for sample surveys, which I shall discuss here; but I believe that this discrepancy 
also holds for other statistical designs, such as experimental and evaluation studies. 

In practice, surveys are usually multipurpose. Why then are multipurpose designs neglected 
in sampling theory? Because multipurpose theory would be too complex and difficult, and 
sampling theory is rather complex already; specific exceptions will be noted later. Even the 
descriptions we read of actual sample designs tend to follow and to borrow the prestige of 
univariate and unipurpose sampling theory, rather than to portray faithfully the many com- 
promises of complex reality. Many common designs (especially equal probability of selection 
method) probably serve robustly a variety of purposes, explicit planning of multipurpose designs 
seems to be rare, though much needed, I propose. 

There are several aspects to the multipurpose nature of survey samples, and these are displayed 
in a hierarchy of six /evels in Section 2. Then ten areas of conflict between purposes are specified 
in Section 3. Sections 4 to 9 deal with specific areas of conflict, presenting approaches to and solu- 
tions for them. Some of these solutions are attributed to widely dispersed articles of survey sampling; 
but others are more novel, hence less fully developed, derived, and referenced. 

In this overall review I aim first and foremost to serve practitioners with handy references 
on approaches, methods and procedures for multipurpose designs; to alert them both to the 
importance and to the feasibility of such designs. Second, I also wish to provide a framework 
for integrated, theoretical future work on the many problems and conflicts of multipurpose 
designs. Imperfections of my methods can serve as stimuli to others for better derivations for 
them, as well as for developing new methods. 


' Keynote address at the International Symposium on Statistics, Taipei, Taiwan, August 1986; and also at a seminar 
of Statistics Canada, October 7, 1987. 
2 Leslie Kish, Institute for Social Research. The University of Michigan, Ann Arbor, MI 48104 USA. 
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2. A HIERARCHY FOR LEVELS OF PURPOSES 


To begin with, we need some clarification of the meaning of ‘‘multipurpose’’, because too 
many concepts are confused under this term in our literature. To reduce the confusion, I classified 
a score Of purposes into six levels in Table 1. Most of the time either multiple variables or multi- 
subject surveys (levels 3 or 4 in Table 1) are discussed and ‘‘multi-subject’’ (4) has sometimes 
been distinguished from multipurpose (3) for the same or closely related variables (Murthy 1967). 
Each of these six levels is shown in several specific manifestations, which can be usefully 
augmented and discussed in more detail elsewhere (e.g., United Nations 1980; Lahiri 1963). 

Integrated survey operations on level 5 are related to, but should be distinguished from multi- 
subject surveys, because they refer to organizations and institutions that conduct many surveys 
in diverse fields over longer periods of time (United Nations 1980; Foreman 1983). An earlier 
name was “‘continuing survey operations’’, when it was recognized that most large-scale, wide- 
spread sample surveys were conducted by continuing survey organizations like the U.S. Census 
Bureau, Statistics Canada, or our Survey Research Center. Such continuity has large advantages 
in costs and quality, with restraining effects on sample designs (Kish 1965). 

Master frames or master samples on level 6 refer to further extensions and specializations 
of multipurpose approaches. They may refer simply to using the same maps, or block listings, 
or area segments for several different surveys; or to the large-scale example of the ‘‘Master Sample 
of Agriculture’? (King and Jessen 1945), where rural areas on the maps of all the counties of 
the USA were divided into segments of about four farms each; or to the firm that sells current 
listings of dwellings for most samples used in Western Germany. These very diverse examples 
have common bases in the savings from sharing the ‘‘startup”’ costs (of design, stratification, 
listing, etc.) for constructing sampling frames. 

Diverse statistics based on single variable and diverse domains (levels 1 and 2) have been typically 
neglected in the literature of multipurpose sampling, although they are the most common, but 
they can have the most drastic effects and cause the most dramatic conflicts, as we shall see later. 
The effect of designs can be very different for statistics like medians and quantiles or regression 
coefficients than the effects for means and for aggregates (Kish 1961; Kish 1965; Kish and Frankel 
1974). Furthermore, designing for period samples brings on new considerations (Section 8). But 
most dramatic effects can be seen simply for the means of small ‘‘subclasses’’ (e.g., as small as 
0.10 or 0.01) of the entire sample, representing similar ‘‘domains’’ in the population (Section 5). 

Each of the six levels of purposes presents different aspects for designs and each level can be 
fruitfully explored for more specific meanings and examples, some of which are listed in Table 1. 

The difficulties of multipurpose designs, which have caused them to be neglected and avoided, 
are of several kinds. First, the different purposes must be formulated explicitly in statistical terms, 
so that these may serve in formulas for their comparisons and for formulated compromises; but 
obtaining a (complete) list of such explicit, formal terms may be the principal obstacle. Second, 
estimates of variance and cost factors are needed for each purpose. Third, for some methods values 
must be obtained for the assigned to the ‘‘required’’ precisions for all the purposes (Section 5). 
Fourth, the above values and estimates must be combined in a mathematical formulation in order 
to arrive at the solution of a single ‘“‘optimal’’ design to be actually used. The computational tasks 
for such solutions have been eased by electronic computers, but the conceptual and theoretical 
tasks remain (Section 5). 

The difficulties of these tasks help to explain why discussions of multipurpose designs have 
been largely neglected designs in textbooks. However, note later references and bibliography here 
and in Rodriquez-Vera (1982); also Cochran (1977), and Chatterjee (1967). Furthermore, also in 
descriptions of actual surveys, often a single statistic (e.g. the mean) of a single principal variable 
is presented as the only (principal) purpose for the study. In the framework of multipurpose design 


' 
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Table 1 


Hierarchy of Purposes 


1. Diverse statistics from the same variables 
- Totals or means or medians and quantiles, distributions 
- Analytical statistics: regressions, categorical analysis 
- Time aspects: static, macro-change, micro-change, cumulative 


2. Diverse populations and domains (subclasses) 
— Proper classes and crossclasses 
— Comparisons of subclasses 


3. Multiple variables on the same subject 
— Alternative measures of one variable; 
e.g. of income, or unemployment 
— Diverse periods — per day, week, month, year 
~ Several aspects of one subject: income, savings, wealth 


4. Multisubject surveys 
— Several subjects on same schedule, interview, operation 
— Health surveys of many diseases 
- Market research for several clients, many goods 
- Agricultural surveys of many crops 
-— ‘‘Omnibus”’ social surveys 


5. Continuing, integrated survey operations 
- NSS in India, CPS in USA, NHSCP of UN 
~ Separate surveys from one office and field staff 
—- Common source of surveys 
— Diverse methods, costs, operations, allocations, respondents 


6. Master frames 
- Several samples from one frame or set of listings 
~ Separate institutions, organizations 
~ Separate field staffs? Same PSU’s? 


design this is equivalent to assigning zero importance to all other purposes. The unreality of this 
pretense may be softened by assuming that other principal purposes would result in similar 
allocations; but this pretense should be buttressed with calculations of the four steps above. 


3. AN OVERALL VIEW OF TEN AREAS OF CONFLICT 


A brief overall view of ten areas of conflict, listed in Table 2, should be useful before we look 
at specific problems and possible solutions for each. The list will probably not prove exhaustive, 
and readers may well find other areas. Even more likely, they may find within these ten areas other 
problems and other solutions not explored here. It would be convenient if the ten areas of con- 
flict should be linked rationally to the twenty purposes presented in six levels; we then could reduce 
this presentation to say, twenty purpose/conflict nodes or to ten level/conflict nodes. Unfortunately - 
the areas of conflict denote a perpendicular dimension to the purpose and all (or most) of the 
10 x 6 cells have meaningful contents. 

Of this long list of ten areas of conflict fortunately not all need to be formulated for every actual 
sample design. I believe that possible conflicts about a) the sample sizes m, and about b) the rela- 
tion of biases to sampling errors should always be considered, at least informally, because they 
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are ubiquitous. Also c) allocation among domains and d) allocation among strata should receive 
at least a brief discussion, and often more. Computing sampling errors (j) should also be done 
on most surveys. However, in the common case of one-time surveys, conflicts 1) about design over 
time need not be considered. On the other hand, in a continuing operation with a continuing 
sampling frame, the decisions about e), f), g), and h) (stratification, cluster sizes and measures) 
may have been made a long time ago for a fixed design. However, the cluster sizes (f) used in 
intermediate stages (blocks and segments) may be open to flexible operational changes. 

It is also reassuring to know that compromises based on statistical methods can yield quite 
acceptable results, for several reasons (Sections 5-8). First, because moderate departures from 
optimal allocation result in only small or negligible increases of variance. Curves of efficiency tend 
to be flat within broad areas around the optimal points; thus great accuracy for separate designs, 
which would not be feasible, are not needed. Second, because wide departures from optimal 
allocations can, on the other hand, cause moderate to large increases in variances. Thus, ignoring 
important purposes can result in substantial losses of efficiency for them, and therefore those 
purposes should be included in compromise designs. Third, compromise designs, in accord with 
statistical methods, can reduce drastically the potentially large losses from allocations optimized 
for other purposes, and with only small increases over the separate optimal designs for each purpose 
(Section 5). 


4. SAMPLE SIZES AND BIAS RATIOS (B/o) 


These two areas of conflict, a and b in Table 2, should perhaps be considered most important 
overall, because they can be most dramatic. We treat them together here only because they may 
be closely related through the effects of subclasses. Let us begin with the familiar (simple random 
sampling with replacement) sample size m = S?/V needed to yield a “‘required”’ precision = V 
for a sample mean y, with element variance = S*. However, the Se depend greatly on the 
variables and on the domains, indexed jointly with g for the year ¥,; and the ‘‘required”’ 
V; may vary even more. We also include design effects De that also vary, and thus 
Mz = S?D3/ V; expresses the sample size needed for the mean of the variable g. For the mean 
¥, of a domain g, comprising only the proportion P, in the population the overall sample size 
needed for the domain becomes Nz = M,/P,, and it is more practical to formulate the needed 
sampling fraction f, = ng/N = S;D;/V;P,N. The factor (1-f) may be neglected or included in 
De The P, become small and critical if high precisions are ‘‘required’’ for small subclasses. 

For comparisons of subclasses the variances increase even more: Ms, = (nm, +.m,.)51 = 
n(P, | + P,')~', with the P, and P, denoting proportions in the sample n (assuming S2 = S?). 
E.g., for the comparison of two subclass means of 0.01” and 0.10n, we have the ‘‘effective size”’ 
m, = n(0.01~' + 0.10~')~' = n/110. For other statistics, such as medians and regression 
coefficients, formulating “‘required’’ sample sizes would become complex. It is more than we may 
discuss here, but some numbers may probably be specified. 

Considerations for subclass statistics become greatly modified if, in addition to variances 
o”, we also include biases B? in the Root-Mean-Square-Error = RMSE = V(o2 + B2) 
for measures of accuracy. Figure | is meant to portray a common tendency in the accuracy of 
survey data, although great differences in the relations of biases to sampling errors are possible; 
reading the legend is urged here. It occurs commonly that potential biases B; are greater than the 
measurable sampling and variable errors o;, for the entire sample. However, on the horizontal 
axis the standard error o; is shown to increase by a factor of about 3 for o> of a subclass of about 
1/10 of the total sample. For comparisons (differences) of two such subclasses 03 increases by 
about 1.4 more. 
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Table 2 
Ten Areas of Conflicts (a-j) 


a.(4)' Sizes m, or rates f, are needed for purposes g 
2. g2m2 _ en2/y oli we 
V, = S,D,/m, and m, = S,D,/V, or f, = S,D,/V,P,N 


where m, denote subclass sizes and f, = n,/N = m,/P,N denote sampling rates 


b.(4) Relation of biases to sampling errors in RMSE = vV(o? + B?) 
—~ The bias ratio B/o decreases as o increases for subclasses 
- For comparisons B/o tends to be small as B decreases, o increases 


c.(5) Allocation of the m, among domains 
m, = Lm, 
d.(6) Allocation of m,, among strata h 


Ms, <3 Ly, Meh 


e.(6) Choice of variables for stratification 
Multivariate stratification 


f.(7) Optimal cluster sizes 


H 
DH ahh p,(b,—1)]b, = P,n,/a for crossclasses 
g.(7) Measures for cluster sizes 


h.(7) Retaining sampling units (PSU’s) for changed subjects, measures and strata and 
for diverse subjects. 


i.(8) Design over time 
How much overlap? Panels? Change versus cumulation. 


j.9) Computing and presenting sampling errors. 


' The numbers (4) to (9) refer to sections with treatments. 


However, the hypotenuses denoting the RMSE are shown to increase much less. In RMSE, 
the bias B, is shown to dominate, and this may happen for some variables in large total samples. 
However, the subclass RMSE), because the bias was kept constant at B, = By, increased only 
moderately and is dominated by o>. This is even more true for RMSE;, where the 0; has increased, 
but the biases — assumed to have the same sign, because that is a common tendency — decrease 
B; in the difference of means. 

Examples of these phenomena abound everywhere and for all purposes are listed in Table 1. 
We choose the best known, critical statistics of unemployment, where admitted measurement biases 
may completely swamp the low values (e.g., 0.1 percent) of measurable fluctuations. However, 
for small subclasses (e.g. Black teenage boys) the sampling errors for small sample bases over- 
take the biases. For periodic comparisons the sampling variations become even more critical. 

These relations among biases and variable errors assumed here are not logically necessary, 
but empirical and common. Neglect of these simple relations leads to a great deal of confusion 
concerning the need for sample surveys of adequate precision, i.e. with small sampling errors, 
a. | propose Figure 1 as practical answers to some common questions, such as: Why do we spend 
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93 


Figure 1. Variable errors (a) and biases (B) in root mean square errors (RMSE) 


The bases represent sampling errors and other variable errors (a). For example o, may be the ste(,) for the mean 
¥ of the entire sample and o, may be a larger ste(¥.) for a subclass mean, and o; may be the sfe(y, Ys ¥,) for the dif- 
ference between two subclass means. . eee 

The heights represent biases (B) and the hypotenuse denotes the RMSE = V(o2 + B2). (1) For the entire sample 
the bias B, may be large compared with the variable error o,, thus taking larger samples would not decrease the RMSE, 
by much. (2) However with the same bias B,, but with a smaller sample in the subclass, the ratio changes and the oy 
dominates the RMSE,; and this is not much larger than for (1) despite a much smaller sample. (3) Furthermore, for the 
difference of means, the net bias B; may be much smaller; so that even with a larger a3, the RMSE, for the difference 
is but little greater than RMSE). This drastic change in the bias ratio B/a tends to appear not only for differences between 
subclasses within the same sample, but also for differences between repeat surveys. 


money for large samples and on rigorous sampling methods in the face of large measurement biases? 
Why bother computing sampling errors when response biases dominate the total error? The implicit 
answers come from the domination of sampling errors in the subclasses, and even more in their com- 
parisons. Let us make these implicit answers more explicit in future sample designs. 


5. ALLOCATION AMONG DOMAINS 


This most important and frequent area of conflict has several aspects. First, consider the 
allocation of total sample size (or effort or cost) among the domains that constitute a partition 
of the total population. A common example is allocation among the several (5, 10, 20 or 50) pro- 
vinces or regions or states of a country; those domains typically have very unequal populations 
N,, with ranges of 1 to 100 perhaps in relative sizes, though they may cover roughly equal sur- 
face areas. Often the question takes this form: Should the sample sizes n, be roughly equal; or 
should the ny be proportional to the N,, with constant sampling rates f, = f? Equal n, tends to 
yield roughly equal errors, ste(¥,) for the means. On the other hand, constant f, = f tends to 
yield the lowest ste(¥,,) for the overall mean j,, = LW, because it yields lower errors for the 
larger domains. This error may be lower than ‘‘required’’ for ¥,,, especially in view of potential 
biases (Figure 1), and may not justify large total sample sizes and costs. This is the contention of 
proponents of equal sizes n, for provinces. However, increased sampling errors for j,, are also 
suffered by most other subclasses, especially ‘‘crossclasses’’ like age, sex, socioeconomic classes, 
etc. whose sizes tend to proportionality to the total. Those are common disadvantages of the highly 
unequal fy = ng/N, for provinces that result from the equal n, values. 

For example, in the Current Population Surveys of the USA, larger f, are assigned to the 
smaller states. The resulting weighting increases the variances (for a fixed total cost) of the overall 
means and also of ‘‘crossclasses’’, such as young men and women, and especially of Black teenage 


i 
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boys and girls (with critically high unemployment rates). Similar conflicts between national and 
provincial needs occur in all countries, because provinces have widely different populations. The 
need for better provincial data, for fixed total cost, conflicts with greater precision for national 
and for ‘‘crossclass’’ statistics. 

To reduce the usual confusion, I distinguish ‘‘domains’’ to denote partitions of the popula- 
tion, from ‘“‘subclasses,’’ the corresponding partitions of the sample. Then I distinguish “‘design 
domains’”’ (and subclasses) to refer to partitions (like provinces and regions) that are contained 
in strata defined by the sample design, from ‘‘crossclasses’’ (like age, sex, occupation, income, 
etc.) that cut across the sample design, both clusters and strata, often almost randomly. The design 
effects differ for these two types of subclasses (Kish 1961, 1980, 1987). 

In addition, other sources of conflict may arise from domain differences other than their sizes: 
in the distribution of variables, also in the variances De Se precisions; but we need not enter into 
those complexities here. Beyond calling attention to the problems, we refer to two distinct tech- 
nical methods for the joint solution of the conflicts in allocation, (the fourth step noted at the end 
of Section 2). One approach uses iterative nonlinear programming in order to satisfy for minimal 
cost the “‘required”’ precisions jointly for all stated purposes. These elegant solutions to diverse 
problems exploit modern computers and have been published in many articles since 1963 (see 
reviews and references in Bean and Burmeister 1978, Rodriquez-Vera 1982, Cochran 1977). The 
‘required minimal”’ cost often turns out much too high, because the ‘‘required’”’ precisions were 
unrealistic. Then the solutions are drastically rescaled downwards. But such rescaling exposes the 
false pretensions (in my view) of this elegant approach that depends on unrealistic ‘‘required’’ preci- 
sions. Principally, I question the reality of ‘‘step functions’”’ for ‘‘required”’ precisions that assign 
a constant value to any variance below the required V and zero value to variances above it. 

A very different approach calls for some form of averaging between all the ‘*optimal’’ (pre- 
ferred) allocations for various purposes, by minimizing the combined (weighted) variance elther 
for fixed cost or fixed sample size. Of course, if the resulting combined variances turns out to be 
too high (or low), the solutions can be scaled up (or down) in total fixed cost or sample size. I prefer 
this solution, which compromises between different allocations, each of which would optimize for 
only one purpose (Yates 1981; Dalenius 1957). It involves esemie relative values of importance 
I, to all the list statistics and this may seem difficult (but an ‘‘ignorant’’ decision-maker can assign 
eal I, to all of them). But the other two alternatives are more extreme and they are bound to 
prove even more difficult: either to specify the ‘‘required’’ precisions of all statistics for the first 
approach, which then assigns arbitrarily equal weights of importance to all of them; or to specify 
one statistic for the total weight of one, and thus zero weights for all other statistics. 

Furthermore, compromises for the average can be shown to be generally feasible and wor- 
thwhile, because the allocations are insensitive to moderate changes of weights of important (as 
is often true in statistics). After all, changing the relative importance by ratios of e.g., 2 or 5 should 
be less drastic than assigning the total weight 1 to one variable and 0 to all others, a process that 
implies infinite ratios of importance. 

First, denote with 4; V2 ./n; the variance attainable for a statistic g with the allocations of sample 
sizes n; for the ith component of variation. Then let 1 + L git) = (2; Vi )/nj)/ V2(min) = DF (Or i/N; 
denote the ratio of increase (with the allocation ,) in the variance of fe gth statistic over its own 
minimal variance, both for the same fixed Ln;. Thus L,(n) is the relative loss over the minimal 
value of 1, and accepting the relative variances Goi, n, as the functions to be minimized is a critical 
decision; those functions seem to me more reasonable than any others that I can imagine for the 
functions to be combined in (1) below. For example, I prefer them to the Vii which depend on 
arbitrary units of measurement, which are removed by the V2,(min). But in rare cases we may be 
faced with V-(min) = (or very small and this may make Gi widly large and unstable; in these 
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Table 3 
Loss functions (1+ L) for two populations (Kish 1976) 
(A) (B) 
(1+ L) for 133 countries: 
(1+L) for W,/W, = 4 0.2 to 100 mm 
Allocations m; PTE ETE Tomewith 
weights 
UW; Lp, /2 Joint LW,y;, Uy,/133 Lt OLY 
MW; i 1.56 1.28 1 6.86 3.93 
M/H 1.36 1 1.18 3.34 1 yd We 
«Vl (W, POs 1s25 tt 02 1.35 1.54 1.44 
«V(W? + H-?) 1.116 1.080 1.098 rest 1.28 1.295 
«V(0.5W?2 + Pies 1.47 Fo) Poees(bs32) 127 
«Vv (2 Wie.) 1.20 1.44 (1.32) 1.28 
«V(4W? + H-2) 1H 196609. (1139) 1.23 


In (A) there are two strata and domains (W, = 0.8 and W, = 0.2); note that the allocation m; = VW; does almost 
as well for the joint loss as the optimal. 


In (B) we have the populations of 133 countries, ranging in size from 0.2 to over 100 millions, a range of 500 in relative 
sizes. From this problem of allocation (for the World Fertility Survey) we omitted, for practical reasons, the four largest 
countries and a few under 0.2 millions. Their inclusion would raise the variance of relative sizes, W;, from 2.5 to 12, and 
would make the results more dramatic. Note that the VW; allocation reduces losses quite well. Some compromise is 
better than none. But the optimal allocation, V(J We + H-2), is considerably better. Different values of 1 o/1y(= 172, 2/1 


and 4/1) increase slightly the variance of the joint loss function with (1:1) weights; but they remain Stee for joint loss 
functions with their own weights /,/J,:1. 


Two examples in Table 3 illustrate the surprisingly good compromises between conflicting allocations yielded by the 
method of weighted averaging: its results on the fourth row of Table 3 compare very favorably with the others. The 
reasons for the excellent results come from the very broad flat surfaces for the optimal allocations, as discussed in Sec- 
tion 2 and shown elsewhere (Kish 1976; Kish 1987). For example, in Canada the 10 provinces vary seventy-fold from 
smallest to largest population sizes, and thus resemble B in Table 3; they serve as a graphical illustration in Figure 2. 
(See also Fellegi and Sunter 1974.) 


cases assign arbitrary values to the C2; or to the J, below. These and the following including 
Table 3 are developed and discussed by Kish (1976). 

Then with the weights /, assigned for relative importance of the gth statistic for any set of 
allocations n; of the sample sizes, 


1 + L(n) = L,1,(1 + L,(n)) = L,l,L;C2;/n; 


oe 2 ru 2 
= DR FA Ors 77 = Lj;Z;/N;. (1) 


After changing the order of summation, we created the new variables Z?; = BCH This 
function may be minimized to give compromise solutions for fixed total cost Lc;n;. For the con- 
flict between ny = n/H of equal sample sizes for domains versus ng = nW, proportional 
to domain sizes W,, the optimal compromise allocations are found to be proportional to 
V(W2, + H~), with equal values for J,. 

An important example was provided by the (otherwise excellent) World Fertility Surveys, 
which used roughly equal sample sizes for small and large countries: actual sample sizes 
varied only within the range of 3 to 10 thousand and with no discernible correlation with 
population size. Consequently, there were two- or three-fold increases of variances in the 
continental averages of national surveys, their ‘‘main contributions to knowledge’’: 
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Figure 2. Five Alternative Allocations of Sample Sizes ,, of Fixed Total Ln, 


The ten provinces of Canada illustrate graphically the usual conflicts from major domains with unequal 
sizes, also the feasible successful compromises. 


1 Allocation proportional to domain sizes n, « W,, is diagonal. 
2 Equal allocation n, a 1/H is a horizontal. 
Divergences of the two allocations are large near the ends. 
3 The square-root allocation, n, « VW, yields compromises at both ends. 
4 The ‘‘optimal”’ allocation n,« V(W2 + 1/H2) improves both ends, and especially with an 
appealing ‘‘floor’’ near the lower end. 
5 A ‘‘weighted”’ optimal n, « V(.8W2 + .2/H2) improves the upper end considerably. 


‘“‘So far, the main contribution to knowledge has been to confirm the downward trend in 
fertility that characterized much of Asia and Latin America in the 1970’s and to highlight 
the contrast with Africa where both fertility and the desire for large numbers of children remain 
high’’ (Macura and Cleland 1985). 


6. ALLOCATIONS TO STRATA AND CHOICE OF STRATIFIERS 


Domains and strata often get confused in discussions, but the two aspects should be kept distinct 
in practical work on designs. Domains refer to subpopulations for which separate estimates are 
sought, whereas strata are usually smaller partitions created for decreasing variances. For example, 
within provinces as domains more strata may be created to reduce province variances; but cross- 
domains like age, sex and economic status tend to straddle across the strata. Allocations of sample 
sizes to strata, though often not as crucial as allocations to domains, may be important in case 
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of efficient disproportionate optimal allocations. The two methods of Section 5 for allocating 
sample sizes to domains can also be applied to allocations to strata, although the aims differ. Some 
of the references on nonlinear programming refer to domains and others to strata, and some 
confuse the two. 

The presence of several survey variables and statistics among the purposes have clear implica- 
tions for using more stratifying variables. Different survey variables will tend to have diverse 
optimal relations with the stratifiers; then it is best to use many stratifiers, even if each stratifier 
is used with only few stratum divisions (categories). Multipurpose design is the best reason for 
multivariate stratifiction (Kish and Anderson 1978). It may also best justify the need for ‘‘con- 
trolled selection’’ methods. The choice of stratum boundaries, called ‘‘optimal stratification’’, 
is a related topic, but of less importance in this condensed presentation. 


7. CLUSTER SIZES; MEASURES OF SIZE; RETAINING UNITS 


In descriptions of sample designs we find sometimes that the design effect has been approx- 
imated with DD: =a ite pAlb; — 1)], where p stands for a synthetic intraclass correlation of the 
‘‘most important”’ variable g and b, = n/a, the average cluster size. This would yield the effec- 
tive element variance SDs and the variance S2D3/ n for the mean of the variable g. However, 
we must question the contents of n and of 5,. If our population consists of married women of 
childbearing age, they may be only 10 percent of total persons and found in only 30 percent of 
dwellings; and much fewer than that for some rare populations. This situation has been treated 
in sampling for rare traits (Kish 1965). ‘‘Ordinarily we avoid large clusters, because of their adverse 
effects on the variance. But even large clusters of the entire population will yield only small clusters 
of arare trait, if this is widely spread. For example, entire blocks may be sampled for persons 
over 65 years of age; entire villages may be searched for persons with an identifiable disease. If, 
on the contrary, the trait is concentrated in small areas, those areas often can be recognized and 
stratified accordingly.’’ 

In multipurpose designs, the crossclasses of the sample will be of variable sizes that are portions 
of the total sample size n,, with M, as their different proportions in the populations. Thus we 
want to estimate in the design not only [1 + p,(b,—1)] for diverse variables g for the total 
sample n,, but also [1 + pe(b, —1)] for many crossclasses. Here, as in Section 6, the index g 
is made to serve both variables and subclasses, in order to simplify notation. Then we make use 
of some conjectures that have been shown to be good approximations in thousands of empirical 
computations for scores of samples: 


[IGA Deb, = 1)] = (1 at p2(M,b, —1)] = [I+ 0,(M,b, — 1)] (2) 


That is, we use b, = M,b, and p, = p, as rough approximations. True that this somewhat 
underestimates the average values of De for crossclasses, because of variations in cluster sizes 
of crossclasses. But that is a small factor compared to the large variations of p, between variables 
(Kish 1987; Verma et a/. 1980; Kish et a/. 1976), and that underestimate has small effects on the 
efficiency of designs. It is important to consider efficiencies of estimates for subclasses as well 
as for the entire sample; these considerations point to considerably higher efficiencies for larger 
clusters than would be shown for 5, and n, for the total sample only. 

Measures of size are related to cluster sizes, but differ because of errors in the available measures, 
due especially to different population contents and to obsolescence. We must also note problems 
concerning measures of size for multisubject surveys and for ‘‘integrated survey operations’’ for 
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different populations, which may especially need drastic compromises. Those two levels of 
purposes (Table 1) should be distinguished because multisubject surveys use single samples in 
one operation; but integrated survey operations may use different sizes of sampling units for 
different surveys (United Nations 1980). For example, consider integrated designs for total 
populations and for agriculture; also perhaps for ethnic subpopulations; also perhaps for indus- 
trial or business activities: the measures of size for each of these may differ greatly. Yet some 
compromise solution may be found to yield reasonable efficiencies for each. 

Measures of size are also closely related to problems for ‘Retaining units after changing strata 
and probabilities’? (Kish and Scott 1971). Those methods were designed to deal with changes 
over time of sampling units, both in measures of size and in stratifying variables; but the methods 
are also relevant for differences between survey variables: 

‘‘Unequal selection probabilities are often assigned to sampling units. Our methods, though 
more generally applicable, are especially needed for the selection of primary sampling units for 
surveys. Often these are selected separately from many strata, with one selection from each stratum. 

‘<A fter the initial selection the units may be used for many surveys over several years. But 
as time passes, the needs of new surveys may be better served by new strata and new selection 
probabilities, based on new data, than by those used for the initial selection. The difference 
between initial and new data may be due to differential changes among the sampling units as 
revealed by the latest Census. Or the differences may be due to changes in survey objectives and 
populations; for example, a sample initially designed for households and persons may later be 
required to serve a survey of farmers, or college students. Obviously our methods are also 
applicable to designing simultaneously a related group of samples with differing objectives.” 

This method allows for using the best measures (for size and for strata) separately for each 
sample purpose, but maximizing the retention of the overlap of sampling units between the samples 
for separate purposes (especially PSU’s). However, it would be possible to design a compromise 
that would average the measures in order to achieve a complete overlap of units, but sacrificing 
some efficiency for each of the purposes. A compromise between the two techniques may be even 
better than either: increase the overlap with small sacrifices of separate efficiencies by recognizing 
only differences of measures that surpass some arbitrary minimal criteria (Kish and Scott 1971). 


8. PURPOSES AND DESIGNS FOR PERIODIC STUDIES 


Periodic studies provide areas of conflict with great and growing importance as their numbers 
and sizes increase. It is wrong to assume that those expensive and influential surveys have only 
one of the five purposes listed in Table 4, because usually they are needed for several or all, if 
the design permits their use. 

In Table 4 we note five purposes and six designs. The first four are paired with similar letters 
on the same four lines. These pairings call attention to designs that best serve, with reduced 
variances, each of the four purposes. Most periodic studies have several purposes and thus we 
should face, and perhaps solve, the difficult problems of multipurpose designs. Actually cur- 
rent levels (A) and net changes (C) can be served with any of the six listed designs, but with some 
increase in variances or in costs. However, individual (gross, micro) changes (D) need panels; 
and cumulations (B) need some changes of samples, and are fastest without any overlaps. For 
current levels (A) variances can be somewhat reduced with estimators using correlations from 
partial overlaps. Net changes (C) benefit from correlations from any overlap, and most from 
complete overlaps (Cochran 1977; Kish 1987; Kish 1965). Reasonable compromises often become 
possible, when purposes can be defined. However, extraneous considerations may rule out some 
designs (e.g., overlaps may be either prohibited or enforced) and thus force the use of less effi- 
cient — but still valid — designs. 
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Table 4 
Purposes and Designs for Periodic Samples 
: Rotation 

Purposes Designs SHS 
A. Current levels A. Partial overlaps 0 < P< 1 abc-cde-efg 
B. Cumulations B. Nonoverlaps P = 0 aaa-bbb-ccc 
C. Net changes (means) C. Complete overlaps P = 1 aaa-aaa-aaa 
D. Gross changes (individual) D. Panels same elements 
E. Multipurpose time series E. Combinations, SPD 


F. Master Frames 


The chief variation in these six designs concerns the amount (and kind) of overlaps between 
periods. The rotation scheme of complete overlaps shows, with aaa-aaa, that the periods have 
all common parts; the nonoverlap with aaa-bbb shows none; and the partial overlap abc-cde- 
efg shows c and e as 1/3 overlaps between succeeding periods only. This section concentrates 
on the effects of varying proportions of overlaps P in diverse designs on different purposes; in 
complete overlaps P = 1, in nonoverlaps P = 0, and in partial overlaps 0 < P < 1. The pur- 
poses are discussed in terms of variances for estimated means, because means (and percentages, 
rates, proportions) are both the most used and the simplest estimates. Effects on other estimates 
will not be entirely different but they are too many, diverse, and difficult to be explored here. 

More discussions of panels is also available elsewhere, with its advantages, disadvantages, 
problems and solutions (Duncan and Kalton 1986; Kish 1987). I call attention to SPD, or Split 
Panel Designs, that I am trying to promote for multipurpose designs. These would combine a 
panel sample P with new rotating or ‘‘rolling’’ samples, so that Pa-Pb-Pc-Pd would symbolize 
the periodic samples. The rolling samples a,b,c,d etc., could be cumulated into larger samples. 
The panel P serves primarily to provide micro (individual gross changes). But it also serves as 
the partial overlap for better estimates of both current levels and macro (mean, net) changes 
for any pair of periods. 


9. COMPUTING AND PRESENTING SAMPLING ERRORS 


It seems questionable to include this topic under design, but I have no doubt that this is a 
multipurpose problem. The strategies for computing and presenting sampling errors deserve sep- 
arate listing as an area of conflict among the many statistics given generally for the results of 
surveys. It is not enough to present standard errors for only one or a few of the most important 
statistics: they are too many and too diverse. Because of that diversity, the practice has grown 
up to compute from the variances other expressions of sampling variability, especially estimates 
of the ‘design effects’’ de also sometimes from the ds = lot pe(b, — 1), estimates of the syn- 
thetic intraclass correlation Rin 

Briefly, I advise: a) Compute sampling errors for many variables, because the variances, the 
design effects (d3), and the intraclass coefficients (o,) can and do differ greatly between 
variables. b) You may have to do some averaging of sampling errors, because it may be inconve- 
nient or confusing to present them all. c) It may be neither feasible nor necessary to compute 
sampling errors for all subclasses, because they can often be approximated with reasonable 
models. d) It is necessary to present sampling errors for subclasses and for other statistics to 
guide the readers of the reports (Kish 1965; Kish 1987; Verma et a/ 1980). I hope that this topic 
will receive in the future from theorists and methodologists some of the attention it needs. 


i| 
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10. CONCLUSIONS 


For the ten areas of conflict of Section 3 approaches and solutions are proposed in Sections 
4 to 9 that are very diverse. Averaging allocations among domains in Section 5 seems to give 
surprisingly good compromise solutions. The advice in Section 6 to use more stratifiers can also 
yield worthwhile gains. In Sections 4 and 7 considerations for subclass estimates lead to drastically 
different decisions for sample designs. In Section 8 we note how periodic designs can be best 
suited to purposes, and best compromise for multipurpose aims. We looked at the different levels 
of purposes and at the various areas of conflicts jointly. Asking the right question is the core 
of most problems. I propose multipurpose design as a new paradigm, to replace ‘*optimal’’ solu- 
tions to artificially partial questions such as: What is the optimal allocation for the mean y or 
the total Y of ‘‘the most important”’ variable? 
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On the Stratification of Skewed Populations 


PIERRE LAVALLEE and MICHEL A. HIDIROGLOU! 


ABSTRACT 


For a given level of precision, Hidiroglou (1986) provided an algorithm for dividing the population into 
a take-all stratum and a take-some stratum so as to minimize the overall sample size assuming simple random 
sampling without replacement in the take-some stratum. Sethi (1963) provided an algorithm for optimum 
stratification of the population into a number of take-some strata. For the stratification of a highly skewed 
population, this article presents an iterative algorithm which has as objective the determination of stratifica- 
tion boundaries which split the population into a take-all stratum and a number of take-some strata. These 
boundaries are computed so as to minimize the resulting sample size given a level of relative precision, simple 
random sampling without replacement from the take-some strata and use of a power allocation among 
the take-some strata. The resulting algorithm is a combination of the procedures of Hidiroglou (1986) and 
Sethi (1963). 


KEY WORDS: Iterative algorithm; Optimum boundaries; Take-all; Take-some. 


1. INTRODUCTION 


Efficient sampling of highly skewed populations such as those displayed by business surveys 
require that they be stratified into a take-all stratum and a number of take-some strata. The whole 
of units the take-all stratum is selected with certainty whereas units in the take-some strata are 
selected by a probability mechanism. Approximate cut-off rules for stratifying a population into 
a take-all and a single take-some stratum have been given by Glasser (1962) and Hidiroglou (1986). 
Glasser (1962) provided the cut-off value under the assumption that a fixed total sample size was 
to be drawn from the take-all and take-some stratum, and that the take-some sampled units were 
to be selected without replacement using simple random sampling. Hidiroglou (1986) provided 
the cut-off value under the assumption that a required level of precision had to be satisfied. These 
two approaches are dual in the sense that Glasser’s objective was to minimize sampling variance 
for fixed sample size, whereas Hidiroglou’s objective was to minimize sample size for fixed sampling 
variance. 

In this article, an algorithm for stratifying a highly skewed population into a take-all stratum 
and a number of take-some strata will be presented. The objective will be to minimize the overall 
sample size given the coefficient of variation of the estimator and the allocation scheme of the 
sample to the take-some strata. The strata boundaries will be derived in term of an auxiliary variable 
which is closely related to the information being collected by the survey. For example, for a census 
of retailers, if yearly sales is one of the variables measured, this auxiliary variable can be used to 
determine the strata boundaries for a single-purpose survey which collect sales on a monthly 
basis. For a multi-purpose survey, given that the strata boundaries have been determined using 


! Pierre Lavallée is Methodologist and Michel A. Hidiroglou is Chief, Business Survey Methods Division, Statistics Canada, 
Ottawa, Ontario K1A OT6, Canada. The authors would like to acknowledge France Bilocq, Business Survey Methods 
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an auxiliary variable closely related to the main variable, the optimality of these boundaries will 
diminish for other variables which are not well correlated with it. The algorithm is a modification 
of Sethi’s (1963) method for stratifying a population. The resulting boundaries, which are optimal, 
will provide the required minimum sample size. 

The allocation scheme which has been chosen to illustrate the method is the power allocation. 
The use of this type of allocation enables the publication of strata estimates which do not have 
markedly different coefficients of variation. Power allocation has been proposed by Carrol (1970), 
Fellegi (1981) and Bankier (1988). It is found to offer in practice a compromise between Neyman 
allocation and the requirement to have equal coefficients of variation for each stratum. A disad- 
vantage of Neyman allocation is that if estimates are required for each stratum, the associated 
coefficients of variation may be quite different between the strata. Alternatively, an allocation 
which achieves equal coefficients of variation amongst the strata may require sample size which 
is much larger than the one required under Neyman allocation. In our context, power allocation 
would enable the publication of estimates for strata of varying sizes (small, medium and large) 
companies with similar coefficients of variation. 

The method developed in the paper will be numerically compared, in terms of boundary values 
and sample size, to the Dalenius — Hodges (1959) cumulative square root f rule, as well as to 
a mixture of the Hidiroglou (1986) and the Dalenius — Hodges (1959) stratification methods. The 
algorithm, which is recursive in nature, is simple to program and converges rapidly to the optimum 
boundary points. It also offers substantial savings in terms of sample size for given reliability 
criteria. 


2. THE PROBLEM 


Consider a finite ordered population of N units: 
1)» YVQ2)2 + + +» YM)» 


with yy S Yi+1 fori = 1, 2,..., N—1. This population is to be stratified into L strata. The 
number of units in each stratum is denoted by N,, h = 1,2,..., L. The sampling scheme calls 
for n, units to be drawn from each corresponding take-some stratum of size N, (A = 1, 
2,..., L—1) without replacement, using simple random sampling, with n, = N,. The mean 
to be estimated is 


Mh 


L 
y= yp Di deitergy/Ne (2.1) 
h=1 


J = Mp- {+1 


h 
where M, = ds NMOL AFD, O. Ro inandsyie 0: 
i=1 


Given this set up, the estimator of population mean Y is 


Pte bray eeees y" Att A BN IM bik poate (2.2) 
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where Yay +1 <%= Ym, fonyasivag. ale Sa eo bi Jet 1), n= WY Nn; 


Borie 1, 2. 07. 5, andy, = 0. 
Assume that the desired level of precision for the estimated mean is specified by c (coefficient 

of variation) and that the proportion of sampled units to be allocated to each of the first L — 1 strata 
L=\ 

8a, (A 219220471)" where yi a, = 1. The term ‘‘a,’’ is conveniently used to 
h=1 

represent any type of allocation to the strata. For instance, in the case of N-proportional power 

allocation, 


Qo oy | CARS Nee, D1) 
YPN 


h=1 


and in the case of Y-proportional power allocation, 


where 0 < p < o. The power allocations have the property that under relatively simple 
assumptions and for a suitable choice of p, the coefficients of variation for the take-some strata 
tend to be equalized without a significant increase in the overall coefficient of variation. This 
equality of coefficients of variation is often desired by the users of the survey data. 

In practice, the value of p is often chosen to be 1/2 or 1/3. A small value of p (i.e. p close 
to 0) usually yields similar stratum coefficients of variation while a larger value increases the 
discrepancy between the coefficients of variation but also increases the precision of the overall 
estimates. 

It would be noted that these power allocations are equivalent to the allocation proposed by 
Bankier (1988) when the population coefficients of variation of the take-some strata are equal. 

The variance of Y is 


~ pw, 
VY)=— Y —(W,- m) Se, (2.3) 
N eas Np 


where S? denotes the population variance of each stratum A. In terms of the desired level of 
coefficient of variation c, V( Y) may be reexpressed as V( Y) = c’Y?. Substituting n, = (n 
— N,) a, and V( Y) = = c’*Y* into (2.3) and solving for n obtains 


L-1 
ao Neus; 7a), 


n= N, + ——=+—______. (2.4) 
(Nc Y) + ye Ny, Se 
ia 
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The problem is to find boundaries bi), ba, .. +, B1—1) (where yu) < bay) <.. 
< b;,_1) < ¥cny) Such that the overall sample size 1 is minimized, given the level of reliability 
c and the specific allocation scheme (represented by a,). 


3. THE ALGORITHM 


The approach used in this paper, for obtaining stratification boundaries for a desired level of 
precision, has first been used by Dalenius (1950) in the case of stratification boundaries for a given 
sample size. It is first assumed that the sampling is done from a population whose frequency 
distribution may with sufficient accuracy be represented by a continuous density f(y). Then, for 


a given set of boundaries by), . . ., by, —1) the following quantities are defined: 
Ph) 
Mae = S(y) dy, (3.1) 
P(h-1) 
P(h) 
Ln = YF(Y) dy/ Wy, (3.2) 
(h-1) 
2 Us 2 
o, = Y Thy) dvi Wie pas, (3.3) 
P(h-1) 
forh = Ie 60 95 Bs with Diy = —-O, biz) = +0, 


Equation (2.4) can then be rewritten as 


E=yI 
v( > Wi oa) 
n= NW, + —*=+_____ 
Nictgizer yy W,, of 

jira 


- (3.4) 


where 


aL) 
h= y f(y) ay. 
(0) 


It should be noted that even if the population is considered to be large, the finite population 
correction (f.p.c.) factor is still present in equation (3.4) - see Dalenius-Gurney (1951). By defini- 
tion, the take-all stratum needs to have a finite population in order to get a finite sample size. 
Also, ignoring the f.p.c. would not lead to a zero variance for the take-all stratum. 

The a, in equation (2.3) can also be represented using the quantities (3.1), (3.2) and (3.3). In 
the case of the N-proportional power allocation, we get: 


i (3.5) 


for Aa= 1) L—1. 
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For the Y-proportional power allocation, the following is obtained: 


W, Dp 
ss : (3.6) 


a, = me 
> (Wa mn)? 
het 


where 0 < p < ©. 
In this paper, the Y-proportional power allocation will mainly be considered but the calcula- 
tions can also be performed for the N-proportional power allocation and, in fact, for any kind 


taal 
of allocation represented by some a; where D a, =1. Putting equation (3.6) into (3.4), we 
h =1 get 


hi JBM 
v| A (W,, on)? (Wi, | ye (W,, mn? 
hue 


(327) 


L-1 
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In order to find the optimal boundaries b,), . . ., b(, 1) such that the sample size n will 
be minimum, the derivatives of equation (3.7) are taken with respect to bq), . . -, A(z-1) 5 
respectively, and equated to zero. The resulting equations are: 


igieias |, 2... 1-2, 
[FT, — F Thi] b(n) + 
[F Ky — 2un FT, — F Kyat + 2uneiF That + 2mn AB - 24n+1 AB) Bony + 
et, pp + FT, of — F Thai tier — F The Ones — ABuE + AByy.1) = 90, @.8) 
and for h = L—1, 
[FT,-, — AB] bér-1 + 
[RiKip 2a PT, 4 2a ABAD at wit 


(FTp-) eres + F Tr-1¢f-1 — ABur-1 = Ebalarseeds (3.9) 


where 
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sy 
B= 2 (W,, o,)° (Wh bn)” 
plait 


L-1 
F=Nec? p? + ye W,, of, 
ff = 4 


K;, = Bp (W,, w,)°"! — A p (W, o,)7 (Wi bn)? |, 
T, = A Wi, (Wh wy) -?- 


Labeling the coefficient of ben) as a, the coefficient of b,;,) as 8, and the remaining terms as 
Yn» equations (3.8) and (3.9) can be represented as quadratic equations of the form a, Din) + B, 
b(n) + Yn = 0. However, as pointed out by Sethi (1963), the terms a, 8, and y, are themselves 
functions of by), . . ., b(,—1) through the integrals (3.1), (3.2) and (3.3). Using Sethi’s (1963) 
approach, equations (3.8) and (3.9) can easily be solved using the following iterative method: 


STEP | : Start with some arbitrary boundaries bq) <... < biz-). 


STEP 2 : Calculate the proportions W;,, the means y,, and the variances of (from equations 
(3.1), (3.2) and (3.3), respectively) based on these boundaries, = 1,..., L—1. 


STEP 3 : Replace the initial set of boundaries by Bay, rae by 1) Where 


— a, +VBi2 -— 4a, V4, 


YS ey REO Ie (3.10) 
2 a 


Din) = 


STEP 4 : Repeat steps 2 and 3 till two consecutive sets are either identical or differ by negligible 
quantities, i.e. 


L-1 
max | Bon) - Bcny| < «for somee > 0. es by 
foal 


It should be noted that it can be proved that the sign before the square root (V_) is 
positive because Bin) lies between p, and p;,,). 

The difficulty of using the above algorithm is that some knowledge of J(»), the approximate 
density, is required. Since the population considered is finite, it is possible to overcome this dif- 
ficulty by replacing the quantities (3.1), (3.2) and (3.3) by corresponding expressions based on the 
finite population. Hence, proceeding as in Cochran (1977), the infinite population parameters given 
by expressions (3.1), (3.2) and (3.3) can be replaced by their finite population counterparts. That is: 


a ee (3.12) 
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oe 1 P¢h) 
Y, = a al Yj) (3.13) 
A J=b(p-1)+1 
1 P¢h) i 
Sys a 3 yd) — Nn Yas (3.14) 
h-l j=b¢q_-3)41 


Poh ~A es, Le 

Using these last quantities, the problem described in section 2 of finding boundaries by), . . ., 
bi, -1) such that the overall sample size 1 is minimized for a given level of reliability c and a 
specific allocation scheme can easily be solved by the following iterative method: 


STEP 0 : Sort the population y;, . . ., yy in ascending order and set bq = Yq) and biz) = Yn): 
STEP 1 : Start with some arbitrary boundaries such that ba < Day <..- < Dir) < Biz): 


STEP 2 : Calculate the proportions W,,, the mean Y, and the variance S? (from equations 
(3.12), (3.13) and (3.14) respectively) based on these boundaries, h = 1,..., L-1. 


STEP 3 : Replace the initial set of boundaries by bi), . . ., B(z—1) where 


- ay + VB = Fon , 


; eng lng taeay oe I 
2 ap 


Oth) = 


STEP 4: Repeat step 2 and 3 till two consecutive sets are either identical or differ by negligible 
quantities, i.e. 


Lal 
max | b(,) — bin) | < € for somee < 0. 
h= 1 


The use of this algorithm with real data will be compared to others in the next section. 


4. SOME ILLUSTRATIONS 


In order to display results given in Section 3, we will use data obtained from the Annual Retail 
Trade and Wholesale Trade Surveys conducted at Statistics Canada. These surveys measure the 
sales of companies whose principal business is retailing or wholesaling respectively. Three popula- 
tions have been used to illustrate the algorithm. They are, respectively, other products in Wholesale 
in Quebec (Population 1), other foods in Wholesale in Manitoba (Population 2), and appliances, 
television, radio and stereo stores in Retail in Quebec (Population 3). Those populations have been 
chosen to reflect different combinations of population sizes: high, medium and low. The skewness 
for these populations is 24.2 (for Population 1), 6.5 (for Population 2) and 13.6 (for Population 3). 

The numerical results provided by the algorithm will be compared to those obtained using two 
other methods. The first method is to simply stratify the population using the cumulative square 
root f rule given by Dalenius-Hodges (1959). The second method is to determine the cut-off 
boundary between take-all and take-some strata using the approximation given by Hidiroglou (1986) 
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and then to apply the cumulative square root f rule to stratify the non take-all population into 

a number of take-some strata. The different methods will respectively be labelled as i) Cum f” 

rule, li) mixture, and iii) optimum, for the currently proposed algorithm. The sole use of the 

Dalenius-Hodges (1959) method is not realistic because it would, in practice, only be used after 

the take-all stratum had been identified using some given arbitrary rule. However, we display the 

sole use of this method to caution against its blind use in the context of highly skewed populations. 
The Hidiroglou (1986) cut-off point is obtained via the following iterative process: 


GaP N- t : oan 1 = V, 
bra = HIN-1') = ae NY ch Y* + Six-r| * G1) 
where 
1 N-t 
UAL AGEN aD IER, p> Yi) (4.2) 
Table 1 
Effect of Varying Coefficient of Variation and Power Allocation 
on Sample Sizes for Three Stratification Methods 
(Population 1 — Size = 1221) 
Stratification Method 
Cum f”% Rule Mixture Optimum 

C D Strata WN, Nn, D(n) N,, Np, Din) N, Nn, Bch) 

O05 0r25 1 JES tall Wares 1017 16 891 11 
2 20 20 377155320 152 14 465,180 290 13 302,912 
3 5 _5 14,786,280 S2q, 52, Al 131,961 40 40 1,835,930 

Total 202 82 64 

0.05 0.50 1 11964 3178* 1017 16 863 10 
2 20 20 35 715,320 152 13 465,180 318 14 289,422 
3 >» 5, 17,786,280 Seen to 1.96) 40 40 1,832,038 

Total 203 81 64 

0.01 1.00 1 1196 616* 751 aH 687 36 
2 20 20 3,715,520 Pal ss 34 196,840 374 78 162,068 
3 rene IS Bert 7 OO, 200) 255255 383,033 160 160 564,076 

Total 641 326 274 

0.05 1.00 1 11965 el SOF ee 3 hl S32 Ome tO ld 16 858 8 
Pe 20 20 14,786,280 £52 11 465,180 323 16 271,920 
3 ath wil 2c tt oLgor 40 40 1,867,254 

Total 205 79 64 

0.10 1.00 1 1196 56* 1073 7 1007 7 
2 20 20 3,715,320 109 4 592,900 191 9 442,357 
3 5 5 14,786,280 oh ae nad UC cal ie i pata 5 BAT ON DA OID) 

Total 81 50 39 


*Requires over allocation to satisfy coefficient of variation. 
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Table 2 


Effect of Varying Coefficient of Variation and Power Allocation 
on Sample Sizes for Three Stratification Methods 
(Population 2 — Size = 44) 


Stratification Method 


Cum f” Rule Mixture Optimum 

c Dp Strata—7 NV,7 07, Din) Nz MM Din) Nom Din) 

mS, ~*0.25 1 42 38 32 1 29 1 
Z 1 1* 137,939,900 6 1 4,708,409 11 i 3,029,455 
3 ORE 2 14595739, 000 Gil 4 16:10,10,622,301 4 4 17,461,464 
Total 40 8 6 

0.05 0.50 42-38 32 1 28 1 
2 1 1* 137,939,900 6 1 4,708,409 ils 1 2,582,819 
3 17B@.£1F 459,739,000 6.156 10,622,301 4 4 17,640,325 
Total 40 8 6 

0.01 1.00 1 42 42 25 1 oe 1 
2 1 1 137,939,900 5 1 1,059,550 10 4 La153),322 
3 deter 9513 2, 000d 3,742,377 Gee 9): 5,969,271 
Total 44 16 14 

0.05 1.00 1 42 38 32 1 26 1 
2 1 1* 137,939,900 6 1 4,708,409 14 2 1,779,500 
3 lnee hy 459;739,000 6 6 10,622,301 4 4 = 17,349,902 
Total 40 8 7 

0.10 1.00 1 42 30 34 1 28 1 
Z 1 1* 137,939,900 6 1 4,848,218 13 1 2,413,800 
3 Py weil 9))04595739,000 4 _4 16,749,625 Syont 390093091,449 
Total 32 6 5 


*Requires over allocation to satisfy coefficient of variation. 


and 


1 N-t’ 
Si bial S18 edna sige pve at 
, (ij tN =2" | 
N-t [eer 


The number of take-all units obtained for each step of this iterative process is t’. The starting 
point for this approximation is 
ge = i ae {N c? Ye ig Sinn (4.3) 


The stopping point for (4.1) is reached when the following inequality is satisfied: 


0<1-n(t )/n(t) < 0.10 (4.4) 
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Table 3 


Effect of Increasing the Number of Strata on 
Sample Sizes for Two Stratification Methods 
p = 1,c = 0.05 


Population 1 Number of Strata 
(N-="1221) 3 4 5 
Stratification 
Method Strata N, Np, Din) N, Np, bony N, Np, D¢n) 
Mixture 1 1017 16 897 6 823 3 
2 152 11 465,180 194 5 311,117 194 2 245,090 
3 52.9052. 13131,961 78 4 641,252} 101 2 465,180 
4 525 0652288. 1,131,961 51 Pe 751,297 
5 Ae — SHesye2, | 1,131,961 
Total 79 67 61 
Optimum 1 858 8 704 3 655 Ms 
2 323 16 271,920 33 7 173,981 358 4 149,327 
3 40 40 1,867,254 112 6 604,869 163 5 453,114 
4 32. 32 2,676,449 29 4 15522,329 
5) is Pe 16 16 5,810,487 
Total 64 48 31 
Population 3 
CNe= 2161) 
Mixture 1 106 6 84 D, 71 1 
2 39 6 265,480 38 Z 185,320 35 1 155,260 
3 16 16 553,255 23 2 335,620 PoP. 1 265,480 
4 16 16 5535255 iy 1 385,720 
5 ae __ L6>Eri6 553,255 
Total 28 ep) 20 
Optimum 1 86 4 55 1 34 1 
Z 65 9 199,415 61 3 125,572 st 1 83,594 
3 10 10 680,942 39 5 312,769 42 a 192,215 
4 6 6 826,942 29 3 382,236 
5 ioe i 5D Me 906,894 
Total 23 15 Ps 
where 
N-t)? Sin—t 
BCE aa ee en I Ne (4.5) 


(Nc Y)? + (N=t) St yey 


Tables 1 and 2 display the results for a large population (Population 1) and a small population 
(Population 2) for a number of different coefficients of variation and power allocations. Table 
3 displays the results for the large population (Population 1) and a medium population (Popula- 
tion 3) by varying the number of strata. For all three tables, the allocation of the sample to the 
take-some strata is the power Y-proportional scheme. 

The following conclusions can be drawn from Tables 1 and 2. The use of the cumulative square 
root f rule to determine boundary points is very inefficient in the present context. Substantial gains, 
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in terms of sample size reduction, are made by using the mixture rule. For the three strata used 
in those two tables, further reductions in sample size of the order of 20% can be achieved by using 
the optimum rule. For a given fixed coefficient of variation, the variation of the power ‘‘p”’ has 
a minor impact on the resulting sample size. As expected, sample sizes increase when the required 
coefficient of variation, c, is decreased (for a fixed power allocation). The optimum method declares 
less take-all units (stratum 3) than the mixture method, or stated another way, the take-all stratum 
boundary is higher for the optimum than for the mixture. The cumulative square root rule loses 
its efficiency in the take-all stratum boundary determination. It is readily observed that the boun- 
dary for this method is significantly higher than those obtained with the other methods. 

In Table 3, we only compare the mixture and optimum methods for two populations, varying 
the number of strata, for a fixed coefficient of variation and Y-proportional power allocation. 
Similar conclusions to those drawn from Tables | and 2 hold. The effect of increasing the number 
of strata is to reduce the number of sampled units for both methods. However, the reduction 
becomes more pronounced for the optimum method as the number of strata increases. 


5. CONCLUSION 


The optimal stratification, of a skewed population into a take-all stratum and a number of 
take-some strata, has provided a substantial reduction in overall sample size for given relative preci- 
sion. The method can be adapted to any type of allocation and to any number of strata. The take-all 
condition can also be excluded. 

The algorithm, which is recursive in nature, converges quickly. It is simple to implement on 
the computer using SAS, FORTRAN, or any other high level language. 
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Research Issues in the 
Survey of Income and Program 
Participation! 


DANIEL KASPRZYK2 


ABSTRACT 


The Survey of Income and Program Participation (SIPP) is an ongoing nationally representative household 
survey program of the Bureau of the Census. The primary purpose of the SIPP is to improve the measure- 
ment of information related to the economic situation of households and persons in the United States. 
It accomplishes this goal through repeated interviews of sample individuals using a short reference period 
and a probing questionnaire. The multi-interview design of the SIPP raises methodological and statistical 
issues of concern to all panel surveys of families and persons. This paper reviews these issues as they relate 
to the SIPP. The topics reviewed are: 1) questionnaire design; 2) data collection, including respondent 
rules, data collection mode, length of reference period, and rules for following movers; 3) concepts, design, 
and estimation; and 4) response error. 


KEY WORDS: Panel surveys; Questionnaire design; Survey design; Longitudinal estimation; Response 
error. 


1. INTRODUCTION 


The Survey of Income and Program Participation (SIPP) is an ongoing nationally represen- 
tative household survey program of the U.S. Bureau of the Census. It provides comprehen- 
sive information on the economic resources of the American people and on how public transfer 
and tax programs affect their financial circumstances. The data from the SIPP provide govern- 
ment policy makers with an information base for studying the efficiency of government tax 
and transfer programs, for estimating future program costs and coverage, and for assessing 
the effects of proposed policy changes. The SIPP is designed to improve the measurement of 
information related to the economic situation of households and persons in the United States, 
and is the culmination of a large-scale development program, the Income Survey Development 
Program (ISDP), which examined concepts, procedures, questionnaires, recall periods, and 
the like (Ycas and Lininger, 1981). 

The need for a survey like SIPP arose because of the limitations of the March Income Sup- 
plement of the Current Population Survey (CPS), the principal source of information on the 
distribution of household and personal income in the United States. These limitations are 
inherent in the survey design, survey instrument, and survey procedures and can not be easily 
modified. As a consequence the Income Survey Development Program was established in 1975 
by the U.S. Department of Health and Human Services to develop methods to overcome the 
principal shortcomings of the CPS — (1) the underreporting of property income and other 
irregular sources of income; (2) the underreporting and misclassification of participation in 
major income security programs and other types of information that people generally find 
difficult to report accurately (for example, monthly detail on income earned during the year); 


! This paper reports the general results of the research undertaken by Census Bureau staff. The views expressed are 
attributable to the author and do not necessarily reflect those of the Census Bureau. 
Daniel Kasprzyk, SIPP Research and Coordination Staff, United States Bureau of the Census, Washington D.C. 20233 
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and (3) the lack of information necessary to analyze program participation and eligibility. 
Several features distinguish field tests of the ISDP from other data collections, particularly 
the CPS. They include: (1) interviews for the same persons were obtained at regular intervals 
within a year; (2) most types of income were reported on a monthly basis; (3) income was 
reported on an individual basis; (4) individuals were followed over the survey period to obtain 
data on changes in income and family composition; and (5) information was collected on special 
topics such as disability, child care, fertility, net worth, and taxes paid to provide insight into 
the context of program benefits, program dependency, and overall economic well-being. 
Because the ISDP was the predecessor to SIPP, many characteristics of the ISDP can be seen 
in the SIPP, including the survey design, content, and questionnaire format. 

The SIPP began in October 1983 as an ongoing survey program with one sample panel of 
21,000 households selected to represent the noninstitutional population of the United States. 
Each household is interviewed once every four months for approximately 2% years; the 
reference period for the principal survey items is the 4 months preceding the interview. This 
interviewing plan results in eight interviews per household. Each year a new panel is introduced. 
This design allows cross-sectional estimates to be produced from the combined sample of 2 
panels. Information concerning the SIPP design, content, and operations can be found in 
Nelson, McMillen and Kasprzyk (1985). 

This paper reviews specific methodological, survey design, and statistical issues of concern 
to the program. The general categories of interest are: (1) questionnaire design; (2) data col- 


lection, including respondent rules, data collection mode, length of reference period, and rules | 


for following movers; (3) concepts, design, and estimation; and (4) response error. 


2. QUESTIONNAIRE DESIGN 


The principal effort of the ISDP was directed to overcoming problems which resulted in| 


underreporting and misclassification of income in the CPS March Supplement. In an ISDP 
field test, two questionnaire approaches were developed. For simplicity, one version may be 
referred to as the ‘‘short’’ form and the other as the ‘‘long’’ form. The short form approach 
attempted to gather income data directly while keeping respondent burden at a moderately 
low level. For each household member, questions were asked directly about the receipt of cer- 
tain income types. If income were received, the amount received during the reference period 
was determined before proceeding to the next source of income. 

The general strategy of the long form approach was to isolate events, experiences, and other 
attributes associated with the receipt of specific types of income. This form contained an exten- 


sive set of probes about the receipt of income and lengthy questions to ascertain income: 


amounts. Amounts associated with specific income types were not obtained until all sources 
of income were determined. 

The hypothesis tested was that the long form approach produces more complete and 
accurate reporting of income; Olson (1980) provides a summary of the analysis conducted on 
the two questionnaire formats. Several approaches to the analysis were implemented and are 
discussed in Olson’s summary: (1) staff observation of training and interviewing; (2) debriefing 


sessions of interviewers and observers; (3) case-by-case reviews of completed questionnaires; — 
(4) analysis of survey and item response rates; and (5) data analyses focussing on the quality 


of the data collected, and questionnaire edit failures, especially those associated with the 
inability of the interviewer to follow questionnaire skip patterns. The form adopted for further 


research and ultimately the SIPP was a variation of the long form. The long form was perceived — 
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by both interviewers and respondents as less burdensome and also was shown to have higher 
income reporting rates. 


An experiment with questionnaire formats was also included in the ISDP; this experiment 
contrasted a household screening format with a person-based format which had evolved from 
prior ISDP field tests. The household screening approach was based on a revised version of 
the questionnaire used in the April 1978 CPS Income Supplement Test and was intended to 
reduce burden by asking a single household respondent whether anyone in the household 
received a particular kind of income during the reference period. Each affirmative response 
was followed by a question to identify exactly which household member(s) received that type 
of income. Complete recipiency for all household members was recorded before asking about 
amounts of income received by specific individuals. This approach was expected to reduce inter- 
view time without reducing data quality. 

The approach above was contrasted with a person-based approach. Under this approach, 
questions on all sources of income were asked of the first household member, then repeated 
for the second, and so on. A separate form was filled out for each adult ina sample household, 
but extensive use was made of skip instructions and check items to reduce the number of ques- 
tions asked of any one respondent. 

Differences in the quality of the data obtained with the two questionnaire formats and dif- 
ferences in the interview times appeared slight. Large differences were not observed between 
the two approaches in estimates of income recipiency rates, and in the incidence of ‘‘don’t 
know” and “‘refusals.’’ Interview time, expected to be significantly less under the household 
questionnaire approach, was about five minutes less per household and about three minutes 
less per person than the person approach. Since the household screening format did not offer 
a significant improvement over the person-based approach, this person-based format, with 
modest improvements and refinements, was adopted for SIPP. 

Questionnaire design issues and discussions concerning data collection procedures continue 
to be part of the SIPP program. The general issue is whether interviews conducted without 
the use of responses from previous interviews (the so-called independent approach) produce 
better estimates than interviews conducted using the previous interview responses to remind 
respondents of earlier statuses (the so-called dependent interview approach). In the SIPP, a 
dependent approach is used to update income receipt patterns at each interview, but the 
approach has not been systematically evaluated. 

A similar dependent approach to data collection is also possible with the data collected in 
the SIPP on personal net worth. These data are obtained at two points-in-time, one year apart. 
Specifically, data on asset and liability values, collected in Wave 4 of the 1984 Panel, were pro- 
vided to one-half of the respondents interviewed in the Wave 7 interview. To examine dif- 
ferences between the dependent and independent approach, one half the sample in Wave 7 was 
provided information on asset and liability values collected in Wave 4, while the other half 
was not provided the previously reported information. 

The rationale for this dependent or ‘‘feedback’’ approach was that respondents would pro- 
vide more accurate estimates of change if they were first reminded of the amount they reported 
the previous year. If respondents know the amount of the change in asset values and were 
reminded of their beginning balance, then presumably their reporting of the current balance 
would be consistent with the true amount of change over the period. Lamas and McNeil (1987) 
analyze these data, but give no definite answer about the impact of the feedback approach since 
benchmark data are not available. They do, however, say that the dependent interview did not 
affect cross-sectional estimates and that the approach produced results consistent with expected 
differentials in net worth across subgroups. They also looked at micro-level changes in net worth 
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using only households with fully reported wealth data and found some evidence that the depen- 
dent interview reduced the estimates of the change in net worth. 

The same questionnaire design issue, the dependent versus independent interview, has also 
occurred in the repeated measurement of industry and occupation. During the 1984 and 1985 
SIPP panels these data were collected independently during each interview even though the 
individual had not changed employers. This procedure acknowledges the fact that an employee’s 
duties may change from time to time and allows these changes to be recorded. Sufficient change 
in duties can result in a change in the person’s occupation classification from interview to inter- 
view even though the employer has not changed. 

The independent collection of industry and occupation data has, however, several prob- 
lems. Undue variation in occupation classification can result when respondent descriptions 
of duties vary slightly or when the interpretation of the written description varies between the 
clerical staff members assigning the classification codes. 

Research into this problem has provided some estimates of the number of times occupa- 
tion and industry classifications change from interview to interview for persons with the same 
employer. Among individuals who reported the same employer during the first 12 months of 
the 1984 SIPP Panel, approximately 40 percent of these persons changed 3-digit occupation 
codes between two consecutive interviews and 20 percent changed 3-digit industry codes (Kalton, 
McMillen and Kasprzyk, 1986). 

As aresult, a modification was made in the 1986 SIPP Panel to reduce changes in occupa- 
tion and industry codes resulting from random response error and clerical interpretation, and 
to reduce interview time. The modification introduces a ‘‘screener’’ question that asks if 
activities or duties have changed during the past 8 months. A negative response eliminates the 
detailed occupation and industry questions. The occupation and industry classifications are 
then derived from responses given in the previous interview. 

It is important to note that while this change was made for the 1986 SIPP Panel, industry 
and occupation data from the 1985 SIPP Panel, collected during the same time period, were 
still collected independently each wave, giving rise to a natural experiment embedded in the 
two panels. These data have not yet been analyzed. 


3. DATA COLLECTION 


Four topics affecting data collection in the SIPP are discussed below: (1) respondent rules; 
(2) data collection mode; (3) length of reference period; and (4) rules for following movers. 


Respondent Rules 


When interviewing households with more than one member, a problem which must be 
addressed is the extent to which proxy responses are acceptable. Since not everyone may be 
present at the time of the interview, both time and money can be saved by asking another 
household member about persons who are not present. The difficulty with this is that along 
some dimensions of the survey instrument, the proxy report may result in less accurate data 
than the self-report. Kalton, Kasprzyk and McMillen (1988) provide a discussion of this issue 
in the context of panel surveys. 

A formal test of responent rules, conducted in the ISDP, compared the quality of reporting 
in a treatment group where proxy interviews are accepted from any household member who 
felt qualified to answer for a missing person with a treatment group where proxy interviews 
are not permitted except for extreme situations (respondent physically or mentally incapable, 
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unable to speak English, away from the household during the entire interviewing period, etc). 
About 85 percent of adults interviewed in the self-response rule households were self- 
respondents and about 65 percent were self-respondents in the usual or proxy response rule 
households. Thus, the implementation of the self-response rule resulted in approximately 20 
percent more self-interviews than the other treatment (Coder 1980). 

Refusal rates were slightly higher for the self-response treatment and the percent of 
households interviewed was slightly higher for the proxy response treatment. The differences, 
however, were too small to give insight into which rule should be preferred. Person noninter- 
view rates in households where at least one other adult was interviewed were higher under self- 
response rules than under usual response rules. Differences between treatment groups in 
reported income recipiency rates also appeared to be small and unaffected by the response rule, 
and combined ‘‘don’t know”’ and “‘refusal’’ rates for income amounts of various income types 
were not consistently lower under the self-response mode. 

Under the self-response rules, records were used more often by persons when answering wages 
and salary questions, and response rates for hourly wage rates were higher, but in general the 
evidence for either set of response rules was not conclusive. Thus, as a result of these findings, 
estimated costs for using a self-response rule (4-6 percent higher than the proxy rule), and the 
implementation of a ‘‘call back’’ procedure to obtain certain critical information unavailable 
at the time of the interview, the SIPP respondent rules now allow proxy interviews to be taken. 

A related problem is the response rule for college students. Students are usually considered 
members of their parents’ households until they establish a permanent residence elsewhere. 
Thus, the usual procedure for students living away from home while attending school is to treat 
them as household members who are temporarily absent and obtain proxy interviews from 
other members of their parents’ household. In order to measure the accuracy of information 
taken from proxy interviews for students living away from home, one interview during an ISDP 
field test was first obtained by proxy at the parents’ household and then by self-interview at 
the student’s school residence. The results of this study are described by Roman and O’Brien 
(1984). The analysis presented is limited due to flaws in the administration and implementa- 
tion of the test. The authors observed, however, that quite often a proxy cannot identify a par- 
ticular source of student income and even if they can identify it, they are more likely to respond 
“don’t know’’ to the particulars about that source. They also noted that the larger the income 
or expense, the better the proxy response becomes. 


Data Collection Mode 


The SIPP has conducted most interviews (approximately 95 percent) face to face (Kalton, 
McMillen, and Kasprzyk, 1986). Because of the rising costs of a face to face interviews, the 
Census Bureau is considering the possibility of conducting a substantially larger number of 
SIPP interviews by telephone. 

As a result, a SIPP telephone interview pretest was conducted in June 1985 to assess the 
feasibility of ‘‘warm’’ telephone interviewing for SIPP — that is, telephone interviews for 
households which received a face to face interview at an earlier wave. The pretest was con- 
ducted in 2 of the Census Bureau’s Regional Offices with a sample of 280 households. Refusal 
rates (about 2.5%) and noncontact rates (about 11%) were within staff’s expectations. Item 
nonresponse rates showed no unexpectedly high nonresponse rates (U.S. Bureau of the Census 
1986). 

Following this, a SIPP National Telephone Test took place from August to November 1986 
and February to April 1987; the purpose of the test was to study the large-scale use of warm 
telephoning in SIPP and to learn whether people are willing to furnish data by telephone for 
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2 interviews in a row. Households within SO percent of the segments were designated as max- 
imum telephone interview cases; the remaining 50 percent were maximum personal visit cases. 
Interviewers conducted almost all of the telephone interviews from their homes. Gbur and 
Durant (1987) report preliminary results from the first phase of the experiment. They indicate 
that household response rates did not seem to be seriously affected by the use of the telephone 
and person nonresponse rates were comparable by mode. Item nonresponse rates were only 
slightly affected by telephone interviewing. Additional results are forthcoming. 


Length of Reference Period 


The ISDP focussed on data collection techniques designed to improve the reporting of cash 
and noncash income, and as such the length of the reference period for most survey items was 
an important design decision. 

This issue was addressed twice during the ISDP. First a single interview using a six month 
recall period was compared with two consecutive interviews, both using 3-month reference 
periods. Second, an experiment was conducted comparing reported property income amounts 
using a 3-month recall versus those with a 6-month recall period. 

Olson (1980) describes some analyses conducted on the first experiment. Not surprisingly, 
using a 6 month recall period understates the proportion of income reported in earlier periods. 
This pattern held for a number of specific sources of income such as wages, Aid to Families 
with Dependent Children, and unemployment compensation. These findings though not 
definitive, support the presumption that longer recall periods increase chances of omission due 
to memory loss. Other analysis showed that the number of sources of income reported per 
household in the first three months of the six month reference period was lower than for the 
corresponding time using a three month reference period. Analyses of the second experiment 
were not conducted due to the withdrawal of funding for the development program. 

The results of the first experiment along with the additional ISDP experience led to a four 
month recall period for the SIPP; this decision maintains cost at the appropriate budget level 
while trying to maintain satisfactory data quality. 


Rules for Following Movers 


An important design feature in the ISDP and now the SIPP is that all persons in a sample 
household at the time of the first interview remain in sample during the 2-% year period of 
the panel; this rule holds even if one or more persons should move to a new address. For cost 
and operational reasons, face to face interviews are conducted at new addresses that satisfy 
some geographic constraint — in the ISDP, the address had to lie within 50 miles of an ISDP 
primary sampling unit, while in SIPP, the address must lie within 100 miles of a SIPP primary 
sampling unit. 

For each panel a sample of addresses is selected and individuals are identified at these 
addresses at the time of the first interview. After the first interview, the sample is no longer 
address-based but rather person-based, consisting of all individuals enumerated during the first 
interview. These people and anyone with whom they share living quarters are interviewed in 
subsequent interviews. 

During the ISDP two issues concerning movers were important: (1) the production of cross- 
sectional point in time estimates at each interview; and (2) the costs associated with following 
movers. Huang (1984) presents several unbiased base weights for cross-sectional estimates of 
the noninstitutionalized population when the sample contains movers. He associates obser- 
vations at any given point in time with the known inclusion probabilities of the original sample 
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households. Two approaches are described: (1) a multiplicity approach, which depends on the 
number of ways that a new household can be included in the sample; and (2) a ‘‘fair share’ 
approach which assumes all household members contribute equally to their household. The 
SIPP as well as the ISDP adopted the ‘‘fair share’’ approach. 

The issue of costs was addressed by a ‘‘Mover’s Cost Study’’. This study was to shed some 
light on the data collection costs resulting from following movers to their new addresses. White 
and Huang (1982) describe the study and provide some results based on the movers procedures 
adopted for the field test. They found that the number of eligible households for interview 
increased by 8.8 percent as a result of following movers during a one year time period; they 
also found that movers represented about 22 percent of the total sample after 15 months, and 
that during this period of time the number of interviewing hours increased by 7 percent and 
the number of miles charged by interviewers increased by 11.4 percent. 

Jean and McArthur (1984) discuss data collection issues in the SIPP as they pertain to movers 
and offer recommendations to improve coverage in future SIPP panels. Kalton and Lepkowski 
(1985) also discuss the procedures for following movers in SIPP, and propose a research 
program aimed at measuring the extent of noncoverage from various sources and its concen- 
tration in particular subgroups. More recently, Jean and McArthur (1987), considering five 
waves of SIPP data, report that among persons who moved sometime after the first interview 
(that is, between Waves 2 and 5), 69 percent completed all 5 interviews, 23 percent did not com- 
plete the fifth interview, and 9 percent were interviewed in the fifth wave but were missing at 
least one intervening interview. 


4. CONCEPTS, DESIGN AND ESTIMATION 


During the ISDP and continuing with the SIPP program, significant research activity has 
taken place in the area of conceptualizing annual units of analysis using subannual data, and 
the statistical estimation of these concepts. The treatment of nonresponse in panel surveys has 
also been a topic of considerable research interest. Finally, estimation techniques to reduce 
sampling error and methods to sample subgroups have also been under study in the ISDP and 
SIPP programs. 


Longitudinal Concepts 


Annual family and household statistics are important indicators of the Nation’s economic 
well-being. The SIPP collects subannual data, indeed monthly data, reflecting changes in the 
composition of households; these data allow the development of annual household statistics 
which reflect actual household composition experiences during the year, unlike current 
household statistics which simply ignore intrayear changes in household composition. The con- 
struction of annual units of analysis, whether they are households, families, or program units, 
raises methodological issues concerning longitudinal weights and imputation techniques. The 
main issue is, however, conceptual. Given intrayear composition change, when is it appropriate 
for annual measures to recognize change in household composition and when is it not? Put 
another way, how should households and families be defined which account for survey 
measurements at two or more points in time and which do not create serious conflicts with 
the traditional cross-sectional household and family constructs. 

Analysts at the Census Bureau have given considerable thought to the question of defining 
households and families over time (McMillen and Herriot 1985; Citro 1985). Empirical research 
to examine several definitions of longitudinal households and measures of annual income status 
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and family type has been reported by Citro, Hernandez and Herriot (1986) and Citro, Her- 
nandez and Moorman (1986). The empirical research emphasized four alternative concepts: 
(1) a household is the same over time if it has the same reference person; (2) a houshold is the 
same over time if it has the same principal person (this definition differs from the first in its 
treatment of married couple households for which the reference person may be either the hus- 
band or wife, but the principal person is always the wife); (3) a household is the same over 
time if it has the same reference person and is the same family type over time; and (4) a 
household continues over time if it has the same reference person, is the same family type, and 
has the same membership size. 

This research has provided preliminary indications that the choice of definition does not 
appreciably affect annual measures of low income status or of households by type. If this finding 
does not change after additional research, considerations, such as ease of implementation and 
operational simplicity, will be the determining factors in the use of a longitudinal household 
definition. 


Statistical Estimation for Longitudinal Concepts 


Research on estimation for longitudinal concepts has proceeded along two paths — 
longitudinal person estimation and longitudinal household (family or program unit) estima- 
tion. The work on person estimation includes the calculation of selection probabilities to yield 
unbiased longitudinal estimates of individual characteristics and the use of controls in addi- 
tional stages of estimation (Judkins et a/., 1984). A refinement of this work and a description 
of the method proposed to produce longitudinal weights for person analysis covering the first 
three SIPP interviews has been reported by Kobilarcik and Singh (1986). 

Kobilarcik and Singh define the longitudinal universe as the noninstitutional population 
(excluding military barracks) on December 1, 1983, the midpoint of the Wave 1 interview 
months. The sample from the longitudinal universe consists of eligible persons living in the 
selected living quarters at the time of the first interview. ‘‘Interviewed’’ persons for purposes 
of this estimation procedure are those who responded to each of the first three SIPP inter- 
views, and who during the first interview lived in a household in which all eligible members 
responded to the interview, and those who resided in a Wave 1 interviewed household, but 
during the second or third interview died or moved outside the geographic boundaries of the 
survey. 

Thus, noninterviewed persons in the estimation procedure are those who at the time of the 
first interview lived in a household in which at least one household member failed to respond 
to the first interview, and those who resided in a Wave 1 interviewed household but failed to 
respond at the second and/or third interview. All persons classifed as interviewed are assigned 
positive weights. Weights for this universe are derived in the usual way, using the reciprocal 
of the probability of selection, calculating an adjustment for noninterviews, and adjusting to 
demographic population controls. The nonresponse adjustment has two phases, an adjustment 
first for household nonresponse and then for person nonresponse, the latter using informa- 
tion collected during the first interview. 

The topic of longitudinal household (family or program unit) estimation is also under study. 
Several approaches to this issue were reported by Ernst, Hubble and Judkins (1984) and more 
recently by Ernst (1988). The latter work describes why weighting by the reciprocal of the pro- 
bability of selection does not, in general, work for longitudinal household and family estimates, 
and presents a class of weighting procedures which can accomplish this task. He, furthermore, 
describes the difficulties that can arise in applying these weighting procedures because the infor- 
mation necessary to create the weight may not be available. Ernst also presents conditions which, 
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if satisfied, by the longitudinal concept, are sufficient for there to exist a weighting procedure 
that avoids these problems. Finally, he discusses procedures for adjusting longitudinal con- 
cepts for nonresponse and for controlling demographic variables to independent estimates. 


Nonresponse and Imputation 


For longitudinal surveys such as those of the ISDP and the SIPP, the problems of refusal 
and selective nonresponse are compounded by cumulative losses in responses over the course 
of the panel. Therefore, an important aspect of both the ISDP and SIPP work has been the 
study of methods for compensating for nonresponse. To that end, Kalton (1983) reviewed pro- 
cedures used in survey research. Imputation procedures were also discussed by Kalton and 
Kasprzyk (1982, 1986), where bias and variance properties for several classes of procedures 
are summarized. 

SIPP data can be treated as both cross-sectional and longitudinal. Procedures to compen- 
sate for unit nonresponse in the SIPP as well as other Census Bureau surveys are described 
in Chapman, Bailey and Kasprzyk (1986). Complications arising in the treatment of unit 
nonresponse in a multi-interview survey are described. In a panel survey, however, nonresponse 
may also occur, as item nonresponse, where a unit takes part in the survey but does not pro- 
vide answers to all items, and as wave nonresponse where a unit provides data for some, but 
not all of the interviews. 

Heeringa and Lepkowski (1986) describe general classes of longitudinal imputation methods 
which might be considered as an alternative to a cross-sectional hot deck imputation approach. 
They also empirically compare a simple longitudinal imputation method, longitudinal direct 
substitution, where the value of a nonmissing item is substituted from one time period to another 
when the same item is missing, with a cross-sectional hot deck scheme. Not surprisingly, they 
demonstrate that the direct substitution method for longitudinal imputation understates change. 
They concluded, however, that this may be preferable to the gross overstatement of change 
resulting from the use of the cross-sectional hot deck method. 

Panel surveys have an additional type of missing data problem called wave nonresponse. 
The amount of missing data for an individual with wave nonresponse is typically greater than 
that encountered for records with item nonresponse. Data available from completed waves 
of interviewing provide more detailed information about the nonresponding unit than is 
available for total nonrespondents. Thus, nonresponse compensation strategies may include 
weighting, imputation, or a combination of both. Kalton, Lepkowski and Lin (1985) discuss 
this issue and empirical findings in the context of the ISDP. This work made it clear that the 
choice between weighting and imputation for missing data of this type is far from obvious. 
Kalton (1986) and Kalton and Miller (1986) further refine the understanding of this problem 
and conclude that imputation can distort some forms of estimates and that weighting may be 
the preferred solution for large subclasses when the reduction in effective sample size is tolerable. 
They caution, however, that imputation may be better for estimates based on small subclasses 
when the loss of sample is important. In the case of a three interview longitudinal SIPP file 
the difference in sample size between weighting and imputation is not substantial, and conse- 
quently the weighting approach is the safer general purpose solution. Finally, Lepkowski (1988) 
after further empirical research concludes that a specific strategy for wave nonresponse can 
only be developed after consideration of such factors as the major survey design objectives, 
the panel design, and the distribution of wave nonresponse patterns. He provides criteria to 
be considered in developing missing data strategies and concludes that weighting strategies 
appear to be preferable for compensating for wave nonresponse. 
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Sampling Error Reductionthrough Estimation Techniques 


Two methods for reducing sampling error through estimation techniques are under study: 
composite estimation and the use of administrative records in SIPP estimation. 

Composite estimation is a technique that combines estimates from the current and previous 
time periods with the goal of improving the precision of survey estimates by taking advantage 
of the correlations between responses for the same analytic units at different time periods. Com- 
posite estimation is particularly effective when the correlations are high, which is likely to be 
the case for many important data items in the SIPP. Chakrabarty (1986) has conducted a 
preliminary review of the types of composite estimates appropriate for the SIPP data struc- 
ture. The content of the survey has not been sufficiently stable during the first few years of 
the SIPP to seriously consider adoption of a composite estimator. 

Another approach to variance reduction is through the use of administrative records for 
post-stratification. Currently, cross-section estimation procedures for SIPP make use of a 
second-stage adjustment to increase the precision of estimates by ratio adjusting collection 
month and reference month estimates to population estimates. However, the Census Bureau 
has access to some Internal Revenue Service and Social Security Administration files which 
can be used to produce detailed age, race, and sex distributions by adjusted gross income. The 
issue, which we have just begun to explore, is how these administrative data can be used for 
post-stratification to improve estimates of mean and median personal and household income 
as well as the estimates of the deciles of the personal and household income distribution. The 
basic question under study is the magnitude of the reduction in variances of these estimates 
achieved through such a procedure. Fay and Huggins (1988) will provide some indications. 


Sampling for Special Subpopulations 


Subgroups of the population are often cited as being more affected by governmental policy 
than others — the population of persons in poverty, the aged, the Blacks, Hispanics, and par- 
ticipants of Federal income security programs. Early design goals of the ISDP emphasized a 
concern for improving the reliability of subpopulation estimates. This was exhibited in the 
emphasis placed in the ISDP on sampling from administrative program lists. Thus, samples 
were oftentimes drawn from lists of current participants of Federal or state administered pro- 
grams (Kasprzyk 1983; Bowie and Kasprzyk 1987). 

A Census Bureau Working Group analyzed subsampling (screening) proposals for over- 
sampling special populations. The issue studied concerned the reliability of estimates when dif- 
ferent subsampling schemes are introduced. Subsampling characteristics based on income and 
demographic variables were identified and estimates of reliability for different subsampling 
rates and characteristics were calculated (U.S. Bureau of the Census 1985). 

This group concluded that subsampling proposals, for a general purpose income survey like 
the SIPP, provided only modest gains in precision for low-income items and did not outweigh 
the disadvantages, which included an increase in the complexity of the operation, the loss of 
a self-weighting design, and large decreases in precision for the middle income items. 


5. RESPONSE ERROR 


Response error is one aspect of a more general problem, nonsampling error, discussed by 
Kalton, Kasprzyk and McMillen (1988). Response error occurs when incorrect data are recorded 
on the questionnaire. This can occur for a variety of reasons, such as a faulty questionnaire, 
memory errors, inappropriate respondents, etc. In this section we briefly describe a response 
error issue with the SIPP gross flow data and a record check study aimed at providing insight 
into a better understanding of response errors in general. 
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SIPP Gross Flow Data 


Analysis of program data on a month-to-month basis in ISDP revealed a tendency for 
reported program turnover to occur between waves of interviewing more often than within 
the wave (Moore and Kasprzyk 1984). Analysis using the SIPP data (Burkhead and Coder 1985) 
covering month-to-month changes in receipt of income benefit amounts for a 12 month period 
focussed on changes occurring between the last month of one reference period and the first 
months of the succeeding reference period. The results using SIPP and ISDP data are similar, 
where an uneven pattern of change is observed and this pattern is clearly associated with the 
interviewing scheme. Gross changes are significantly higher between the last month of one 
reference period and the first month of the next. Hill (1987) used monthly data-from the 1984 
and 1985 waves of the Panel Study of Income Dynamics (PSID) to investigate the extent and 
determinants of excessive change between waves relative to measured change within waves of 
a panel survey. He found that in spite of different question sequences, and recall periods, 
between wave transitions dominate the within wave transitions in the PSID just as they do in 
the SIPP. The main causes for the problem are not known, but questionnaire wording/design, 
respondent recall error, and the interaction between these two factors seem likely. 

Weidman (1986) did an empirical analysis to look for obvious relationships between respon- 
dent characteristics and changes in receipt status of a number of income types. He did not detect 
any relationship between gross change distributions, self/proxy status and nine demographic 
variables (age, race, sex, education, marital status, household size, tenure, relationship to 
reference person, and size of metropolitan area) for consecutive months, but did note that more 
transitions occur when some of the data are imputed. The absence of any notable relation- 
ships indicates a need for exploring other ways to understand this problem. 

Interest in gross flow estimates remains high. Hubble and Judkins (1986) developed a model 
to estimate biases in gross flows estimates resulting from response errors, the parameters of 
which are estimated using SIPP response error rates and the ratios of within-wave and between- 
wave gross flow estimates. Several strong assumptions, as well as a reinterview program which 
produces accurate reinterview data on gross flows within the period, are necessary. Weidman 
(1987) presents linear models that try to represent the relationships between observed and actual 
transitions. The models are admitedly oversimplified using only survey reported data, but never- 
theless, illustrate the need to obtain more information about the SIPP error structure in 
reporting receipt of benefits from government transfer programs. 


SIPP Record Check Study 


One way to study the SIPP error structure in reporting receipt of program benefits and 
amounts is to develop validation studies of items common to both survey records and 
administrative records. The SIPP program has initiated such a study to investigate response 
quality issues. 

The goal is the improved understanding of the quality of the SIPP data and, ultimately, 
the development of quantitative estimates of response and nonresponse errors in order to adjust 
the survey data or modify survey procedures to obtain better quality data. The research ques- 
tions addressed in this study include: (1) the quality of the respondent reports of receipt of 
program benefits for a variety of state and Federally administered transfer programs; (2) the 
quality of benefit dollar amount reporting for these programs; (3) demographic correlates of 
report quality; (4) extent of misclassification errors; (5) the effects of self-proxy respondent 
status on report quality; and (6) between wave recipiency turnover effects. Four state 
administered programs and six Federally administered programs are included in the study. 
Moore and Marquis (1987) provide very preliminary results, suggesting that reporting problems 
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are different for the Aid to Families with Dependent Children (AFDC) and the Food Stamp 
Programs, the former having a net under-reporting as well as a time placement problem for 
reporting a transition in program status while the latter having only a time placement problem. 


6. CONCLUSION 


As in all large-scale continuing survey programs, research is needed to improve understanding 
of the effects of survey methods on the data collected. A survey, like the SIPP, which is complex 
in its implementation requires a commitment to understanding the measurement process. The 
wide range of topics discussed above — collection, longitudinal concepts and estimation, and 
response error — illustrate where the interest and emphasis was placed during the development 
program and the first few years of the SIPP program. 
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ABSTRACT 


A personal computer program for variance estimation with large scale surveys is described. The program, 
called PC CARP, will compute estimates and estimated variances for totals, ratios, means, quantiles, 
and regression coefficients. 


KEY WORDS: Survey sampling; Variance estimation; Survey software. 


1. INTRODUCTION 


The analysis of survey data typically involves a large number of observations and relatively 
complex variance calculations. Recent developments in personal computers have made possible 
the use of such computers to process data from complex surveys. We describe a personal com- 
puter program for survey data analysis prepared at Iowa State University. 

The project to develop statistical software for variance estimation on the personal computer 
was a joint undertaking between Iowa State University and the International Statistical Programs 
Center of the U.S. Census Bureau. The objective of the Census Bureau was to provide developing 
countries with software that can be used locally to process survey data collected locally. The Iowa 
State University project on variance estimation was part of a larger Census Bureau undertaking 
that included the development of software for survey management, data editing and tabulation. 

Beginning in the early 1970’s, based on the work of Hidiroglou (1974) and Fuller (1975), a 
program was developed at Iowa State University for the computation of regression coefficients 
and the estimated covariance matrix of the coefficients for survey data. The program, called 
SUPER CARP, was later expanded to include total estimation, ratio estimation, subpopula- 
tion statistics, two-way tables and two stage samples. The last revision of SUPER CARP took 
place in 1980. SUPER CARP furnished the starting point for software development on the per- 
sonal computer. Because of its ancestry, the personal computer program was called PC CARP. 


2. PROGRAM CAPABILITY 


PC CARP was designed for the IBM PC, IBM PC/XT, IBM PC/AT and compatible 
machines. At least 410K bytes of memory and a math coprocessor are required. 

PC CARP is capable of handling both large and small data sets with equal ease and efficiency. 
The program sets no limit on the number of strata or clusters that can appear in a data set and 
can accept up to 50 input variables at a time. The program accepts disk data files in either fixed 
or free format. 

The program can be used to compute variances for one or two stage samples with finite 
population correction terms included. For samples with more than two stages, finite population 
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corrections are only available at two levels. For two-stage samples, the program computes within 
cluster sampling rates from the stratum sampling rates and the individual observation weights. 

Typically, each observation in the data file will contain stratum identification, cluster (pri- 
mary sampling unit) identification, and a weight where the weight is the inverse of the selection 
probability. The user may or may not elect to enter first stage sampling rates. For simple designs, 
such as simple random sampling, not all of this information is required. In such cases reduced 
data input is possible. 

If stratification is present, the program requires that all observations belonging to the same 
stratum be grouped together. If clustering is present, all observations belonging to the same cluster 
must be grouped together. 

Table | contains a description of the types of statistics available to the user of PC CARP. In 
addition to the items of Table 1, supplements are available for estimation of the logistic function 
and for post stratified samples. These supplements are discussed in Section 4.4 and Section 4.5. 
An ‘‘X’’ in the column headed ‘‘Cov. matrix’’ means that the covariance matrix of a vector of 
estimates of the type listed on the left can be obtained. The standard error is computed for all 
statistics, but the covariance matrix of a vector is available for only a restricted set. Also, the coef- 
ficient of variation is computed for many statistics. The design effect, denoted by DEFF, is available 
as an option for many of the statistics. See Kish (1965) for a description of the design effect. 


Table 1 
Analysis Capabilities of PC CARP 


Analysis Coen Gore Deed Comments 
var. marix effect 
Population Analyses 
Total Estimation xX xX x 50 variables maximum 
Ratio Estimation xX x x 50 variables maximum 
without covariances 
15 with covariances 
Difference of Ratios x 15 variables maximum 
Stratum Analyses 
Totals x 50 variables maximum 
Means xX xX 50 variables maximum 
Proportions XxX x 50 variables maximum 
Subpopulation Analyses 
Totals xX x Crossed classif. 
Means xX X Multiple variables 
Proportions XxX XxX Crossed classif. 


Multiple variables 


Ratios x x Crossed classif. 
Multiple variables 


Other Analyses 
Two-way Table xX 50 cells maximum 
proportionality test 
Regression 4 50 variables maximum 


Multiple d.f. tests 
Y-hat, residuals 
Univariate x Multiple variables, 
empirical CDF, 
quantiles 
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The population (Total, Ratio and Difference of Ratios) analyses and stratum analyses are 
performed in a straightforward manner. Some details pertaining to Subpopulation Analyses, 
the Two-Way Table, Regression Analysis, and Univariate Analysis are presented in Section 4. 

The subpopulation analyses give the user the option of crossing classification variables. This 
allows the user to create new classification structures from two or more input classification 
variables. For example, suppose the input data includes the classification variables age, sex and 
education with six, two and five levels, respectively. Then, by crossing age with sex with educa- 
tion, a new classification structure with 60 levels is produced. The user may obtain estimates 
for any number of dependent variables under this classification structure. 

The Two-way Table analysis is defined by two classification variables and a dependent variable. 
More than one dependent variable can be specified for a pair of classification variables. Tables 
of cell totals, of proportions based on row totals, of proportions based on column totals, and 
of proportions based on the grand total are computed for each dependent variable. Standard 
errors are computed for all estimators and a test statistic for the hypothesis of proportionality 
is output. The test statistic is based on a Satterthwaite approximation to the distribution of the 
Pearson chi-square statistic. Also see Rao and Scott (1984). 

The weighted least squares regression analysis computes coefficient estimates, and an estimated 
variance-covariance matrix which takes into account the sample design. These calculations are 
given in Fuller (1975) and outlined in Hidiroglou et a/. (1980). Multiple degrees of freedom 
F-tests for sets of coefficients and the usual t-statistics are available. The user also has the option 
of obtaining residuals and predicted values. 

The Univariate analysis provides statistics that describe the distribution of a variable. The 
user specifies the variable of interest and identifies a subpopulation by specifying a category of 
a classification variable. Thus, the user might elect to obtain statistics for the personal income 
of individuals in the professional category of the occupation classification. Estimates of the sub- 
population mean, variance, distribution function, quantiles and interquartile range are produced. 


3. PROGRAM DETAILS 


PC CARP is written almost entirely in FORTRAN, the most widely known scientific pro- 
gramming language, and the IBM Professional FORTRAN compiler was selected for the pro- 
ject. A small portion of the code — some sections of the user interface — is written in IBM 
Assembly language. 

Two concerns at the program development stage were to provide a friendly user interface and 
to minimize the number of passes through the data. The interface was made user friendly by 
implementing an interactive, screen oriented response system. A single pass algorithm for variance 
estimation of simple statistics minimized the amount of reading from data files. Most estimators 
and their variances are obtained in a single pass through the data. 

Estimators can be computed for the total population, for each stratum, or for specified sub- 
populations. For the most part, the estimators are functions of weighted sample totals. For 
example, to compute the estimators of the ratios, R; = Y,/X,and R, = Y,/X>, one accumulates 
totals for Y,;, X, Y>, and xX). If the estimate is for the entire population, these totals are 
accumulated in one pass through the data. Totals for stratum estimates can be accumulated, com- 
bined if necessary, and output stratum by stratum. Since the data are grouped by strata, stratum 
totals can also be obtained in one pass for any number of strata. Subpopulation estimators may 
require more than one pass through the dataif the number of categories defined by the classifica- 
tion structure is large. The Regression and Univariate analyses require two passes through the data. 
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The estimators, with the exception of totals, are nonlinear functions of weighted sample 
moments. It follows that a method appropriate for a nonlinear function must be used to estimate 
the variance of the approximate distribution of such estimators. See Wolter (1985) for a discus- 
sion of variance estimation for complex surveys. The Taylor method (method of statistical dif- 
ferentials) is the method of variance estimation used in PC CARP. Generally, the Taylor method 
has been shown to be equal to or superior to other variance estimation methods for the statistics, 
such as ratios, under consideration. See, for example, Frankel (1971). The Taylor variance of 
the ratio estimator is given in such standard texts as that of Cochran (1977) and the Taylor 
variance of a regression coefficient is given by Fuller (1975). 

The value of the estimator and its estimated variance can, in most cases, be computed in the same 
pass. This is because the first order Taylor approximation to the variance can be expressed in terms 
of the variances of totals. For example, the first order Taylor approximation to R = Y/X is 


Retake Rie omiak Vole REG). 


where R = Y/X is the ratio of the true totals. It follows that the estimated variance of a ratio 
R = Y/X can be computed from the estimated variance of the totals of Y, _X, and (Y — X). 
Similarly, the estimated covariance matrix for R, = Y,/X, and R, = Y>,/X>, can be computed 
from the estimated variances of the totals of the ten quantities Y,, X), (Y; — X1), Yo, Xo, (Y> 
mi 2) ae oN) Niort Xa) eo ea Xa) and (Xp Xo): 

The algorithm used for the calculation of the weighted mean and weighted sums of squares 
and cross products matrices is described in Herraman (1968). For sample values .X; and cor- 
responding weights |W, the sequence of weighted means, Xx, and weighted corrected sum of 
squares, Sx, is computed as 


Xx = XK te axndx and Sx = Sr ar Dr = Drax; 


where dy = Xx — Xx_1, Qx = Wx(L4_,W;)~', and Dy = dz Wr. 

Up to three different variance quantities can be accumulated concurrently for any given 
estimator. These are the first stage variance component, the optional second stage variance com- 
ponent and the optional simple random sampling variance used in the computation of the design 
effect. Computing all variance quantities in a single pass through the data requires a large amount 
of array space. However, when working with large samples, the elimination of entire passes 
through the data out-weighs the use of additional memory. 

The program routinely performs checks to avoid computational errors such as division by 
zero. For example, if the user enters a data set with only one cluster in a stratum, the program 
will assign zero variance to the stratum, complete the calculations, and print an error message 
identifying the stratum with a single cluster. 

The error handling system was constructed to avoid program termination caused by user 
misspecifications that could be easily corrected. Checks for omitted responses, improper file 
names and invalid analysis variable specifications are included in the program. If such an error 
is detected, PC CARP permits the user to re-enter information or to exit the program. 

Program accuracy was assessed by constructing examples and comparing results with those 
obtained using the mainframe program SUPER CARP. The data set of Longley (1967) was used 
to evaluate the accuracy of the regression program. Additional checks were made using PROC 
MATRIX of the SAS package. See Barr, et al. (1979). PC CARP numerical accuracy was found 
to be at the same level as the mainframe packages. Internal consistency of PC CARP was also 
verified by computing equivalent estimators using different options, e.g., by computing a sub- 
population mean using the subpopulation option and using the ratio option. 
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When information is needed by PC CARP, the user receives a full screen of short response 
questions along with detailed instructions. The first set of screens displayed to the user ask for 
information pertaining to data organization and location. ‘‘Help’’ and ‘‘Go Back”’ options are 
available at many places. 

The second phase of program execution is Analysis Specification. In this phase the user chooses 
the type of analysis, options for that type of analysis, and the analysis variables. Any number 
of analyses can be performed using the data specified in phase one. 


4. SPECIAL FEATURES 


4.1 Two Way Table 


As described in Section 2, this option automatically provides the user with four tables, where 
the entries are determined by the type of marginal control exercised in constructing the table. 
We outline the procedure used to construct the table of cell proportions and the estimated 
covariance matrix of the proportions. Suppose the table has R rows and C columns and let Y,, 
be the estimated total for the rc-th cell. Let Y be the RC-dimensional column vector of cell totals, 
created by listing the columns of totals one beneath the other beginning with the first column. Let 


R Cc 
vil. ae Dp Tees 
r=i*e= 31 
Be ONO 
ieee am en oz 


be the estimated population total and the estimated cell proportion for cell rc, respectively. 
Let P be the RC-dimensional column vector, analogous to Y, composed of the RC values 
P,., arranged by column. The estimated covariance matrix for P is 


Vop = Y~*[Irc — (P ® Jeo) VyyUre — (P ® Jeol’, 


where Vyy is the estimated covariance matrix of the vector of cell totals Y, Ipc is the identity 
matrix of dimension RC, and Jpc is an RC-dimensional column vector of ones. 

The matrix Vpp is used to compute the test statistic for the hypothesis of proportionality. The 
null hypothesis for the test is the hypothesis that the interior entries in the population table are 
the products of the marginal proportions. See Rao and Scott (1984) for a discussion of tests for 
such hypotheses. The test in PC CARP is based on a Satterthwaite approximation to the distri- 
bution of the Pearson chi-square statistic constructed as if the proportions were multinomial 
proportions. The approximation is valid for any analysis variable. 


4.2 Quantile Estimation 


Among the statistics produced by the univariate option are estimates of quantiles and an 
estimator of the standard error of the quantiles. The first step in the computation of quantiles 
is the construction of an estimator of the cumulative distribution function. In a first pass through 
the data the range of observations, the sample mean, and the sample standard deviation are con- 
structed. Also, the three largest observations and the three smallest observations are identified. 
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The estimated cumulative distribution function is defined by 


F y(x) = ( 


wd m 
ML: ws) ye Ww, Zsly(X), 


t=1 t=1 


where the summation is over the m elements in the sample, w, is the sample weight, Zs, is an 
indicator function that is one if the observation is in the subpopulation of interest and zero other- 
wise, and /y/x) is one if Y < x and is zero otherwise. The range of the variable is divided into 
100 intervals and the cumulative distribution function is estimated at the 101 values defined by 
this subdivision. 

The covariance matrix for the estimated distribution function evaluated at 25 points, / = 1, 
5,. . ., 96, is estimated. The estimated standard errors are smoothed with a three point moving 
average and interpolation is used to obtain an estimated standard error for each of 101 points 
of the estimated distribution function. Linear interpolation is used to create an estimated distri- 
bution function that is monotone increasing. Using the smoothed standard errors, a monotone 
increasing upper bound and monotone increasing lower bound that form a pointwise 95% con- 
fidence interval for the distribution function are established. These bounds are then inverted 
to form 95% confidence interval for the quantiles. The interquartile range and its standard error 
are also estimated. 

The quantile estimation is based on a theory that assumes the existence of an underlying super- 
population distribution function with a positive density. See Francisco (1987) for theoretical 
details and Park (1987) for computational aspects. 


4.3 Regression Estimation 
Estimates of the coefficients of a linear regression model are computed by the method of 
weighted least squares. Using the procedure given in Fuller (1975), an estimator of the covariance 


matrix of the coefficient vector is computed, taking into account the sample design. 
The coefficient vector is estimated by 


5b = (X’WX) |X’ WY, 


where X is the n X p matrix of independent variable values, Y is the n-dimensional vector of 
dependent variable values, W is a matrix with the observation weights on the diagonal and zeros 
elsewhere, and n is the total number of observations. The variance of 6 is estimated by 


Vib) = (X’ WX)  'Gy(X’ WX)". 


The matrix Gy is 


: 
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where 
mi; 

dj = X ij Viik Wik, 
ke 


n= (n; — asry, C—(n —.p) .(n—1), m,; is the number of elements in cluster j of stratum /, 
n; is the number of clusters in stratum /, 7 is the total number of observations, L is the number of 
strata, and p is the number of coefficients estimated. The variance estimator differs from the usual 
weighted least squares variance estimator in that the matrix Gy is used in place of (X’ WX)s°. 
A multiple R-squared statistic is computed for models with an intercept. An F-test for the 
overall regression is always computed and an option for testing subsets of coefficients is provided. 


4.4 Logistic Regression 


Estimates of the multivariate logistic model are obtained with this option. The algorithms for 
logistic regression were developed after the initial version of PC CARP was completed. Because 
the mean function for the logistic model is nonlinear in the parameters, the estimates are computed 
using an iterative weighted least squares algorithm. The variances of the estimates are computed 
by the extension to nonlinear estimation of the procedures given in Fuller (1975). See also Binder 
(1983). The basic operation of the Logistic Regression option is the same as that of the Regression 
option. For example, independent and dependent variables are specified in the same way. 


4.5 Post Stratification 


After completion of the original PC CARP program a supplement for post stratification was 
developed for many of the estimators. The post stratification is assumed to be that in which the 
weights have been adjusted to produce estimates for certain categories that match known popula- 
tion totals. This type of post stratification is called gamma post stratification by Fuller and Sullivan 
(1987). 

The program computes the variance of the post stratification estimator based on a represen- 
tation in which the estimator is expressed as a sum of ratio estimators. 


4.6 Stratum Collapse 


For purposes of variance computation, the user may use the collapse option to eliminate one 
cluster strata. If this option is chosen, every one-cluster stratum is grouped with the immediately 
following stratum in the data set. The stratum and cluster identification of the involved records 
are changed to reflect the new stratification. If stratum sampling rates are present, new rates are 
defined by 


x oa sr | 
Le pce et et ie May) 


where stratum /, with n; = 1, has been combined with stratum 7 + 1. These new rates are also 
saved in an auxiliary rate file for possible future use. Different orderings of the strata will produce 
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different collapsed data sets and different collapsed stratum rates. The program requires an addi- 
tional pass through the data when either the collapse or the two-stage option is selected. 


4.7 Hot Deck Imputation 


PC CARP requires a complete data set for analysis. Many practitioners will write a special 
program, or use one of the readily available PC programs to edit their data and to impute for 
missing values. 

For those desiring it, a hot deck imputation program, called PRE CARP, is provided with 
PC CARP. The hot deck operation replaces a missing value with the value for the same item 
from the record immediately preceeding the missing record in the data file. PRE CARP permits 
the user to specify a classification variable, containing up to ten categories, such that the missing 
value is replaced by the preceeding record in the same category. PRE CARP will also create an 
indicator variable for each variable with missing values. This indicator variable can then be used 
with the subpopulation option to compute means based on the original observations. 


5. EXAMPLES 


In this section, several analyses are performed with a constructed data set and run times are 
presented. The purpose of the test runs is not to examine all possible combinations of factors 
influencing processing time, but rather to give an idea of the time required to run some of the 
available program analyses. 

The test data were constructed from a subset of the second National Health and Nutrition 
Examination Survey (NHANES II). The test data set has 2400 observations which are divided 
into 32 strata. Each stratum has two primary sampling units and the primary sampling units 
are of varying sizes. Each observation also has a non-zero sampling weight. 


Figure 1. Output for Example C, Mean Age by Sex and Race Combinations 


Subpopulation Means 
Dependent variable is Age 


Category Estimate S.E. C.V. DEFF 

sex = 1.0000 Race’ = ‘\1.0000 

5.06811D+01 6.19678D-01 8.0197D-02 1.3967D +00 
sex = 1.0000 Race = 8.0000 

5.13016D+0O1 7.88580D-01 &.5193D-02 9.5384D-01 
sex” ‘= 1.0000 Race = 3.0000 

3.41579D+01 28.24111D+00 6.5610D-02 2.0965D+00 
nex, =) «2.0000 Race = 1.0000 

1.337428D+01 3.18904D-01 &.5845D-02 1.2588D +00 
Sex = 2.0000 Race = %.0000 

AK KOK 
Sex.15 ...0000 Race = 9.0000 


1.71957D+01 9.53816D-01 5.5468D-02 1.1389D +00 
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Figure 2. Univariate Output for Nonfarm Households for Example D 
UNIVARIATE 1 
Classification variable is Farm and its level is 1 


Number of Sample Elements in Subpopulation = 111 
Dependent variable is Age 

Subpopulation Variance = 4.25645D +02 
Subpopulation C.V. = 7.14101D-01 


Subpopulation Mean 


Estimate S.E. C.V. DEFF 
2.8891089D +01 2.2543875D + 00 7.80305D-02 9.86336D-01 
Extreme Values of Sample Elements in Subpopulation 
Smallest Number of First Observation ID 
Values Observed Values Stratum Cluster Weight 
1.000D + 00 1 32 1 }.000D + OO 
2.000D + OO 1 15 i: 2.000D + OO 
5.000D + OO 3 10 2 3.000D + OO 
Largest Number of First Observation ID 
Values Observed Values Stratum Cluster Weight 
7.400D +O1 2 29 ii }6.000D + OO 
7.100D+0O1 2 a Hl 2.000D + OO 
7.000D +01 4 ie it 2.000D + 00 
Quantiles 
Estimate S.E. 95% Confidence Interval 
0.01 28.2690811D+00 7.5167585D-01 (8.05729D-01, 3.73243D +00) 
0.05 4.2364814D +00 1.1977759D + 00 ( 3.34942D + 00, 8.14053D + 00) 
0.10 7.7750203D +00 1.35635759D + OO (5.36691D +00, 1.07924D +01) 
Or25 1.3652930D + O01 1.4225576D + 00 (9.99238D +00, 1.56826D +01) 
0.50 1.9449315D +01 2.2740912D + 00 (1.571928D +01, 2.48156D +01) 
0.75 4.5698071D+01 4.7577709D + 00 ( 3.58936D + 01, 5.49247D +01) 
0.90 6.2787426D+01 2.5472775D + 00 (5.51417D +01, 6.45308D + 01) 
0.95 6.5923423D+01 1.22283544D + 00 (6.43837D +01, 6.92750D +01) 
0.99 7.1714993D+01 1.1425033D + 00 ('7.05401D +01, 7.40000D + 01) 
Interquartile Range 
Estimate S.E. 


5.2045141D+01 4.2434890D +00 


The variables in the data set are: 


Pe ocx |. = male, 2. =. female 

2. Race [c="white 2 = black, 3 = other 

3. Farm 1 = non-farm household, 2 = farm household 
4. Income Household income in thousands of dollars 

5. Age Age in years. 


A variable whose value is one for every observation (intercept variable) was created by the pro- 
gram. The analyses performed were: 

A. Mean income for the sampled population 

B. Mean income by stratum 

C. Mean age for the two way classification of sex and race 

D. Sample distribution functions of age for farm and non-farm groups. 

Analysis A, estimating mean income, was performed using the Ratio option with Income as 
the numerator variable and the intercept variable in the denominator. The estimates of mean 
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income by stratum, Analysis B, were computed directly with the Stratum Means option. Analysis 
C was performed with the Subpopulation Means option by crossing the classification variables 
Sex and Race and specifying Age as the dependent variable. The output from this analysis is 
given in Figure 1. The symbols ‘‘*****”’ under the classification ‘‘Sex = 2. Race = 2”’ indicates 
that there were no observations falling into that classification category. The values of the design 
effects underscore the importance of taking into account the sampling design in the computa- 
tion of estimated variances. For example, the design effect for the estimate with Sex = 1 and 
Race = 3 is approximately two. This means that the estimated variance of the sample mean for 
a simple random sample is one half of the variance estimate for the stratified cluster sampling 
plan. Characteristics of the distribution of Age for each of the two levels of the variable ‘‘Farm’’ 
were estimated using the Univariate option. The portion of the output for this example that per- 
tains to nonfarm households is given in Figure 2. All the variances and standard error estimates 
given in this output take into account the sampling design. 

The run times (in seconds) for analyses A, B, C and D for the 2,400 observations were 70, 
135, 120 and 360, respectively. The runs were made on an IBM PC AT with the data stored on 
the hard disk and read in free format. Stratum sampling rates were not entered into the pro- 
gram. Output was routed to the monitor and to a disk file. Design effects for the estimates were 
requested in all of the analyses. The first three analysis require only one pass through the data 
for each analysis. More statistics are computed for analyses B and C than for analysis A. Analysis 
D requires 4 passes through the data, two passes for each univariate analysis. 
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The 1986 Test of Adjustment Related Operations 
in Central Los Angeles County 


GREGG DIFFENDAL! 


ABSTRACT 


As part of the planning for the 1990 Decennial Census, the Census Bureau investigated the feasibility 
of adjusting the census for the estimated undercount. A test census was conducted in Central Los Angeles 
County, in a mostly Hispanic area, in order to test the timing and operational aspects of adjusting the 
Census using a post-enumeration survey (PES). This paper presents the methodology and the results in 
producing a census that is adjusted for the population missed by the enumeration. The methodology used 
to adjust the test census included the sample design, dual-system estimation and small area estimation. 
The sample design used a block sample with blocks stratified by race/ethnicity. Matching was done by 
the computer with clerical review and resolution. The dual-system estimator, also called the Petersen 
estimator or capture-recapture, was uSed to estimate the population. Because of the nature of the census 
enumeration, corrections were made to the census counts before using them in the dual-system estimator. 
Before adjusting the small areas, a regression model was fit to the adjustment factor (the dual-system 
estimate divided by the census count) to reduce the effects of sampling variability. A synthetic estimator 
was used to carry the adjustment down to the block level. The results of the dual-system estimates are 
presented for the test site by the three major race/ethnic groups (Hispanic, Asian, Other) by tenure, by 
age and by sex. Summaries of the small area adjustments of the census enumeration, by block, are 
presented and discussed. 


KEY WORDS: Census undercount; Dual-system estimation; Synthetic estimation; Post-enumeration 
survey. 


1. INTRODUCTION 


Since the first U.S. Census in 1790, problems have existed in finding and counting every 
person who should be counted. Advances in demographics and statistics have permitted census 
coverage estimates to be produced, beginning with the 1950 census. Coverage estimates have 
been used to evaluate census shortfalls and determine areas of needed improvements for 
succeeding censuses. The census coverage estimates have shown a steady improvement in census 
taking since the 1950 estimates were produced. One series of estimates shows the U.S. level 
undercount was 4.4% for 1950, 3.3% for 1960, 2.8% for 1970 and 1% for 1980. Despite this 
continuing reduction in the percent undercount, estimates remain higher for certain groups 
in the U.S. For example, the black undercount has remained about 5 percent above the national 
average. 

Results also indicate high undercounts are measured for other ethnic groups-especially the 
Hispanic population. Central cities have higher undercounts as do rural areas. Males have higher 
undercounts than females. The age group 20 to 45 also has a high undercount. 

The methods used since 1950 to measure the undercount in the U.S. are a post-enumeration 
survey (PES) and demographic analysis. The Census Bureau has announced that these will be 
the major tools to estimate the undercount for the 1990 census. A PES uses an independent 
sample of persons that are matched to the census to estimate the total population. Marks (1978) 
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and U.S. Census Bureau (1979) describe previous work on using a PES to measure census cov- 
erage. Demographic analysis uses birth, death and other administrative records to estimate 
the total population in the U.S. Fay et a/. (1988) describe the 1980 undercount estimates from 
demographic analysis and the Post-Enumeration Program (PEP). 

In 1980, increased scrutiny of census numbers resulted in a number of court suits arguing 
for an adjustment of the 1980 census counts. Some of the issues that led to the court suits 
include: the existence of the differential undercount between blacks and nonblacks; the 
introduction of revenue sharing in the 1970’s which tied monies directly to population counts; 
and declining populations in some cities and states which have traditionally had higher under- 
counts. The U.S. Census Bureau argued against adjustment of the 1980 census for the measured 
undercount on the basis that the measurement was error prone and an adjustment would not 
improve the unadjusted census counts. 

The Census Bureau did embark on a research program after 1980 to evaluate alternate 
methods and ways to improve the undercount measurement process (Mulry ef a/. 1981 and 
Hogan 1984). Hogan (1984) proposed a series of tests to improve the undercount measurements 
in conjunction with the test censuses. These started with a PES in Tampa, Florida in 1985 to 
test and evaluate computer matching. This test verified the feasibility of computer matching 
(Jaro and Childers 1986). Test censuses and PES’s were also conducted in 1986 in Los Angeles 
and Mississippi. A PES was conducted in Los Angeles to test the timing and operational aspects 
of adjusting the census. In Mississippi, a PES was conducted to evaluate the PES operations 
in a rural test site (Anolik 1988). 

A Pre-Enumeration Survey was also conducted in 1986 in Los Angeles to determine if further 
gains in timing could be obtained if some of the field work was conducted before the census 
rather than after the census enumeration as ina PES (Wolfgang 1987). A PES was conducted 
in 1987 in rural North Dakota for evaluation of the PES operations in rural areas, where a 
door-to-door enumeration is conducted rather than a mail-out census as in the other test sites. 
Finally, work is under way for the 1988 Census Dress Rehearsal. The Dress Rehearsal will be 
used to test all census operations before conducting the 1990 Decennial Census. 

The focus of this paper is on the 1986 PES in Central Los Angeles County, called the Test 
of Adjustment Related Operations (TARO), conducted in conjunction with the test census. 
The test site comprised three major race/ethnic groups: Hispanic, about 75% of the total 
population; Asian, about 15%, and Other, mostly white, with about 10% of the total popula- 
tion. The results of the PES show an estimated undercount of 9%. For the major race/ethnic 
groups in the test site, the Hispanic, Asian, and Other undercounts are estimated at 9.8%, 7.3%, 
and 6.2%, respectively. This paper describes the methodology and operational aspects of 
estimating these undercounts. 

Section 2 presents the methodology used in 1986 to measure the undercount and how to 
incorporate the undercount estimates into the census count to produce an adjusted census. 
Section 3 discusses the schedule of operations in carrying out TARO including field operations 
and matching. Section 4 presents a summary of the undercount estimates for the poststrata 
and undercount estimates at the block level. Section 5 summarizes the major findings and 
presents some conclusions. 


2. METHODOLOGY 


2.1 Overview of Samples Used in Estimation 


To estimate the popluation, the PES used two samples, called the P (for Population) sample 
and the E (for Enumeration) sample. The P sample is used to measure census omissions. The 
E sample is used to measure census erroneous enumerations. 
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The P sample consists of a block sample with an an independent listing of housing units 
and personal interviews whereas E-sample data are the census enumerations (counts) from the 
same sample block. The P sample obtained data needed for matching and estimation including 
census day residence. A design decision was made that defined who is included in the P sample. 
The P sample was all persons living at the sampled address at the time of the PES interview. 
The alternate procedure would interview the residents on census day. We decided against the 
latter approach because all movers involve proxy respondents (interview is from nonhousehold 
members). For the approach chosen, movers were living at the sample address and can have 
completed interviews without resorting to a proxy respondent. However all residents on census 
day who moved outside the test site before the PES interview have zero probability of being 
captured in the P Sample. All P-sample persons who lived outside the test site on census day 
were considered out-of-scope. 

After interviewing, all P-sample persons were matched to the census. A computer matching 
program was used with clerical review. A second design decision defined the extent of search 
for matching. The PES classified a P-sample person as matched if the person was counted in 
the census anywhere in the test site. An alternate procedure would define a more limited search 
area, such as the PES block and neighboring blocks. Then a P-sample person is called a match 
only if the corresponding census person is within this search area. As an aside, the 1990 PES 
procedure will use a limited search area for matching. 

All unresolved cases from matching were sent to followup to obtain additional informa- 
tion for matching. The followup workload from the P sample was greatly reduced by asking 
all questions needed for matching at the time of the original interview. Therefore only 
incomplete personal characteristics, incomplete mover address, and uncertain match cases were 
sent to followup from the P sample. Nonmatched P-sample cases were considered resolved 
and not sent to followup. Many E-sample persons are matched to P-sample persons and are 
resolved without the need of another interview. All E-sample persons not resolved from the 
P-sample interview were sent out for a followup interview that is used to determine their 
enumeration status. Operational aspects are discussed in more detail in the following section. 
The types of census erroneous enumerations measured by the E sample included geocoding 
error, duplication, fabrication, persons born after census day, persons who died before census 
day and unmatchable cases. Geocoding error is defined as a census enumeration that exist out- 
side the search area, the entire test site. Unmatchable cases are census enumeration without 
a name. Unmatchable cases cause an overestimate of the number of erroneous enumeration, 
but are treated in a similar manner as erroneous enumerations in the estimator. 


2.2 Dual-System Estimation 


In order to estimate the total population, a dual-system estimator is used which combines 
the information from the P and E samples. Wolter (1986) describes different dual-system esti- 
mators and their underlying assumptions. The dual-system estimator used in TARO is written 


N,(CEN-SUB-EE) 
DSEb Tireuniaaiyghan TE? (1) 


where N, = estimator of the total PES population, CEN = unadjusted census count, 
SUB = number of census whole-person substitutions, EE = estimator of the number of 
erroneous enumeration and unmatchable persons included in the census, derived from the E 
sample, M = estimator of the number of persons in both the census and the PES populations. 
Census whole-person substitutions are defined as any person included in the census with fewer 
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Table 1 
Dual-System Classification 


P-Sample Target Population 


In Out Total 


Census In Ni, Ni Nis 
Enumeration Out N>, Ny, N54 
Total UN N,> NAY 


than two demographic characteristic. In order to better understand and explain some of the 
unique features of the dual-system estimator, Table 1 shows the classification of each person 
in the population. 

The population quantities in Table 1 are estimated by components of the dual-system 
estimators: N;; = M,N,, = N,,N,, = CEN-SUB-EE. The value of N, is unobservable 
by definition but is estimated by assuming independence between the census enumeration and 
the P sample of the PES. The estimate of N >, is given by 


Ny = NiNo/N- (2) 


By using the estimators defined above, the estimate of the total population is given by 
Nf =9DSE! 

Because of problems in matching census data, special handling is needed to prevent an 
overestimate of the population. The dual-system estimator assumes every person is uniquely 
assigned to one cell in Table 1. So instead of just using the census count, the estimate of 
erroneous enumerations is subtracted from the census count to give an estimate of the number 
of unique persons counted in the census. Additionally, the dual-system estimator assumes each 
person can be called a match or a nonmatch. Census enumerations with insufficient informa- 
tion for matching (e.g., no name or fewer than two demographic characteristics) cannot be 
called matches or nonmatches with certainty. Therefore, unmatchable persons are also sub- 
tracted from the census count. All corresponding P-sample persons are called nonmatches and 
assigned to the N5, cell. 


2.3 Sample Design 


The sample design was a stratified sample with the sampling unit being a block. Two types 
of data were used to stratify the test site — a count of housing units by block obtained from 
the 1986 census address file and a mapping of 1980 census race data into the 1986 census 
geographic units. This mapping could only be made at the census tract level which equals one 
to six blocks. Therefore, the assignment of the racial grouping was done at the census tract 
level. All blocks within the census tract were assigned to the same racial category, and thus 
were in the same stratum. 

The test site was stratified into six sampling strata, described in Table 2. 

All blocks with special places (mostly group quarter population) were put into a separate 
sampling stratum. These blocks were considered out-of-scope and were not sampled. Small 
blocks were placed in a separate stratum to reduce the sampling variance. All blocks in census 
tracts with at least 18% Asian defined the Asian strata. All non-Asian blocks in census tracts 
with at least 40% Hispanic defined the three Hispanic strata. All remaining blocks that were 
not in the above strata defined the Other strata. 
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Table 2 
Sampling Strata and Allocation of Sampled Blocks 
Number of 
Sampling Strata Blocks 
Sampled 
1. Hispanic Blocks with large multiunits 8 
2. Hispanic Blocks with small multiunits 49 
3. Hispanic Blocks with single units Bh 
4. Asian Blocks Shs 
5. Other Blocks 38 
6. Blocks with two or fewer housing units 21 


The 1986 housing count data also contained information on single unit and multiunit struc- 
tures. These data were used to split the Hispanic strata into single unit, small multiunit, and 
large multiunits. The Hispanic large multiunits stratum was defined as the Hispanic blocks 
with 50% or more of the housing units in structures with 10 or more addresses. The Hispanic 
single unit stratum was defined as the Hispanic block with more than 50% of the housing units 
in single units. The Hispanic small multiunits stratum was defined as the remainder of the 
Hispanic blocks. 

Within each of the sampling strata, an equal probability systematic sample of blocks was 
chosen. The sample consisted of 190 blocks containing about 6000 housing units. Table 2 con- 
tains the breakdown of the sampled blocks by the sampling strata. Large blocks with 70 or 
more housing units were subsampled to reduce the interviewing workload. The subsampling 
consisted of splitting the block into clusters of 35 to 50 housing units, using address ranges 
or block faces. One cluster was randomly selected for P-sample interviewing. The E sample 
was defined as all persons the census counted in the same cluster. 


2.4 Poststratification 


The dual-system estimator is biased and the bias can be large if the undercount rates are 
significantly different for subgroups of the population (Wolter 1986). To control this bias, 
the test site was partitioned into groups (poststrata) felt to have the similar undercount rates. 
Dual-system estimates were then calculated within each poststratum. 

The poststrata were chosen by examining the test site composition and from analysis of the 
1980 PES data. The most important discriminating variable of the undercount was race. Three 
race-ethnic groups were used: Hispanic, Asian and Other. A separate poststratum for blacks 
was not possible since few blacks lived in the test site. Minority renter was an important 
explanatory variable in our previous research (Isaki et al. 1987). Therefore, tenure was also 
used in constructing the poststrata. Hispanics living in a block with fewer than 50% of the 
population being Hispanic (called Non-Hispanic blocks) were thought to have a different under- 
count rate from other Hispanics and was assigned to a separate poststratum. Table 3 shows 
the seven race-tenure groups which are crossed by age (0-14, 15-29, 30-44, 45-64, 65+) and 
sex to give the 70 poststrata used in estimation. 

Table 3 also shows the sample sizes for the P sample and for the E sample. The lower sample 
size for the P sample than the E sample is partly explained by inmovers in the P sample which 
are treated as being out-of-scope. 
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Table 3 


Race-Tenure Categories Used in Poststratification, 
Including Sample Sizes 


Race-Tenure Categories P Sample E Sample 
Hispanic Renters in Hispanic Blocks 8,182 8,739 
Hispanic Owners in Hispanic Blocks 5,688 5,867 
Hispanics in Non-Hispanic Blocks 896 1,005 
Asian Renters 666 911 
Asian Owners 1,144 1,230 
Other Renters Piss 1,316 
Other Owners 1,841 1,908 
Total 19,552 20,976 


2.5 Handling Missing Data 


To compute the dual-system estimates, a complete data file is needed. The 1986 test con- 
tained missing data, as is true for any sample survey. Schenker (1988) presents a description 
of the methods used to handle missing data, including some effects of different assumptions 
about missing data on the dual-system estimates. For completeness, we give a brief desciption 
of the methods. 

Missing data occurred for person and household characteristics, the match status (matched/ 
nonmatched) for the P-sample persons, and enumeration status (correct/erroneous) for the 
E-sample persons. For P-sample noninterviews, a weighting adjustment was used. Missing 
characteristics were imputed using a ‘‘hot-deck’’ procedure. For match status, a logistic regres- 
sion model was used to estimate the probability of being matched. Rather than assign a defi- 
nite match or nonmatch status to each unresolved case, the estimated probabilities were used 
in the dual-system estimates. An analogous procedure was used for missing E-sample enumera- 
tion statuses. 


2.6 Small Area Estimation 


To make an adjustment additive at all levels of aggregation for users, the estimates of the 
undercount are carried down to the block level (the smallest geographical unit). But before 
carrying the undercount estimates to the block level, a regression model is used to “‘smooth’’ 
the effects of sampling error. Adjustment factors are used as the dependent variable in the 
regression model. An adjustment factor is defined as the dual-system estimator divided by the 
census count: 


Y = DSE/CEN, (3) 


where CEN and DSE were defined previously. 


The regression model is written as 


Y; = Bi + BX; ar ao Se BUX; ae S; mo Ee (4) 
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where Y; = adjustment factor for the /-th poststratum (/ = 1, ..., 70), Xj; = independent 
variable (j = 1, ..., p), B; = regression coefficient to be estimated, S; = sampling error of 
the adjustment factor, £; = model error, and the S; and £; are independent and normally 
distributed with mean 0 and variances equal to o7 and e” respectively. The e” and B;’s are 
estimated using maximum likelihood methods (Ericksen and Kadane 1985). The o? are 
estimated directly from the sample. The sample-based adjustment factor and the model-based 
adjustment factor are averaged together to form the predicted adjustment factor 


AF = (X08 5 ss X;B/¢) G ck <a) igs (5) 
J 


which is used to adjust the census block data. The variance of AF; can be obtained from the 
results in Freedman and Navidi (1986). 

Synthetic estimation was used to carry down the adjustment from each poststratum to the 
census block. The synthetic estimator is written as 


ADJ;; =e AF; x CEN; ;, (6) 


where / and / denote the poststratum and block respectively and ADJ is the adjusted popula- 
tion at the block level. 

The adjusted block population, ADJ; is usually a noninteger number. The census counts 
whole persons. In order to incorporate the adjustment into the census, the noninteger values 
must be transformed into integers. Integerization (or controlled rounding) rounds all values 
to the integer part of the number or to the integer part of the number plus one (Causey ef al. 
1985). 

After integerization of the adjusted block estimates, counts were produced for the number 
of persons by age-race-sex to be added to or substracted from each block. In the case of under- 
counts, a census enumeration having the same range of characteristics as the estimated missed 
person was randomly selected from within the block and copied into a new census record. A 
nonhousehold category was used to add persons to the census so that household relationships 
and creation of new households were not needed. Zaslavasky (1988) describes an alternate 
procedure, using weighting, for adding persons and households to census blocks. In the case 
of overcounts, census persons with the required characteristics would be flagged and would 
not be counted in the adjusted census tabulations. 


3. OPERATIONS AND TIMING 


The major focus of this test census was to study the timing and operational aspects of 
adjusting the census. Previous PES’s at the Census Bureau have taken about two years or longer 
to complete. For example, the 1980 PES produced undercount estimates in the fall of 1981 
and a final set of estimates in early 1982. 

Table 4 presents the major census and PES operations and their start and end dates. Gaps 
exist in Table 4 because all census and PES operations are not listed. Some PES operations 
have overlapping time schedules since these operations were occurring at the same time. PES 
activities started after all major census field activities were completed. This helps ensure inde- 
pendence between the census and the PES by having the field staffs working at different times. 
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Table 4 
1986 TARO Operational Schedule 

Operation Start End 
Census Day March 16 March 16 
Nonresponse Followup April 09 May 08 
Key Census Names May 23 June 10 
Census File for Matching Aug. 08 Aug. 15 
PES Address Listing June 17 June 21 
PES Subsampling June 25 July 01 
PES Interviewing June 25 Aug. 08 
Key PES form July 21 Aug. 19 
Computer Match Aug. 28 Sept. 09 
Extended Computer Match Sept. 09 Oct. 03 
Clerical Match Sept. 15 Oct 
Field Followup Septac3 Nov. 06 
Followup Matching SDE. 29 Nov. 06 
Key Match Results Ogi, “Il Nov. 10 
Prepare P- and E- sample files Nov. 11 Jan. 02 
Imputations Jan. 05 Jan. 11 
Final Census file - Jan. 05 
Estimate Poststrata Jan. 12 Feb. 11 
Small Area Estimates Jexoloy, 12 Feb: 22 


The census was conducted by mailing a questionnaire to every known housing unit and asking 
a household member to complete the form on Census Day (March 16). Each household that 
failed to mail back its questionnaire was completed in person by an enumerator. This is called 
nonresponse followup. Completed forms were sent to the processing office for entering the 
data, which included for this test all census names, into the computer. 

The first step of the PES produced an independent listing of all addresses in the sample 
blocks. The listings were compared to an administrative list to ensure accuracy and complete- 
ness. This quality control check showed that 127 (67%) blocks had no change to the address 
listing. The remaining 63 (33%) blocks had changes made from the quality control check and 
were relisted. The relisting added addresses to 37 blocks, corrected addresses in 39 blocks, and 
deleted addresses in 9 blocks. (Since multiple changes were made for some blocks, the above 
numbers do not add up to the total number of relisted blocks.) The changes in the address listings 
from the quality control check showed only minor corrections. After passing the quality control 
check, all blocks of 70 or more housing units were subsampled using block faces or address 
ranges. 

The PES interview was conducted by personal visits. Questions were asked of all current 
residents to obtain their demographic characteristics. Special questions asked about residence 
on Census Day, mailing address, alternate addresses such as college residence, and other persons 
who may have lived at this residence on Census Day. A quality control check of the PES ques- 
tionnaire verified the roster of names. For the sample of forms checked, 96% passed the quality 
control operation. The 4% that failed the quality control check were reinterviewed and corrected. 

The final outcome of the interviewing showed that 5,714 (93.2%) of the housing units had 
a completed interview with a household member. Another 193 (3.1%) housing units were vacant 
and 189 (3.1%) housing units had a completed interview with a non-household member (e.g. 
neighbor). Only 32 (0.5%) housing units were coded as noninterviews. The extremely low 
noninterview rate is attributable to the 5 week interviewing period. 
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As the PES questionnaires were completed they were prepared for computer matching to 
the census file. The computer matching was split into two parts: first, matching the PES data 
with the E-sample data and second, called extended computer matching, matching all P-sample 
cases that did not match in the first part of the computer matching to the remaining census 
data. The extended computer match was used to match movers between Census Day and the 
time of the PES interview and geographical coding errors, i.e., where the housing unit is assigned 
to the wrong block. The first part of the computer matching assigned a match to 14,700 (73.5%) 
of the P-sample cases and assigned a possible match to another 2,550 (12.0%). The extended 
computer matching assigned a match to another 130 persons (0.7%) and assigned a possible 
match to another 570 persons (2.9%). Because the extended computer matching assigned a 
match status to only a small percentage of P-sample cases, we concluded that the geographical 
coding in Los Angeles had few errors. 

Clerical matching reviewed the results of the computer matching. Clerical matching also 
identified the cases with insufficient data for matching (for which imputation is necessary). 
Clerical matching prepared followup forms for unresolved P-sample and E-sample cases. 

Field followup consisted of 1,551 housing units with 1,511 (97.4%) being recorded as com- 
pleted interviews. The field followup was followed by final matching. The final P-sample results 
show that 17,018 (85.2%) persons were matched to a census persons and 2,373 (11.9%) persons 
were not matched. Another 426 (2.1%) persons were considered out-of-scope (mostly persons 
who lived outside the test site on Census Day) and 161 (0.8%) persons were unresolved (and 
later had match status imputed). The final E-sample results show that 19,637 (93.6%) persons 
were correctly enumerated and 360 (1.7%) were erroneously enumerated in the census. Another 
976 (4.7%) persons were unresolved and had an enumeration status imputed. 

All missing data after final matching including match status for the P sample and enumera- 
tion status for the E sample were imputed. The results were used to create the dual-system 
estimates. The estimates were smoothed and carried down to the block level to create an adjusted 
census file. The improvements in timing to produce the undercount estimates were mainly due 
to the matching activities. The computer and clerical matching for TARO took about 3 months, 
while the 1980 PEP matching activities took over one year to complete. Additional time savings 
were due to improved planning of operations and better access of census materials. 


4. ESTIMATES 


4.1 Poststrata Estimates 


This section presents the undercount estimates for various aggregations of the poststrata. 
Table 5 presents the percent undercount 100(1-CEN/DSE), percent nonmatched 100(1-M/N Ds 
percent erroneously enumerated 100(EE/CEN), and percent substituted 100(SUB/CEN). 

A feature of the dual-system estimator is that the estimates summed over several categories 
does not equal the direct estimate of the summed categories. To keep the estimates reported 
in Table 5 consistent, all estimates are summed over the other relevant categories. 

Examining Table 5 for percent undercount by the race-tenure groups, one concludes: tenure 
is a good stratification variable with higher undercount estimates for renters than for owners; 
race/ethnicity also differentiates the undercount with higher undercount estimates for Hispanics 
than for Asians, which in turn are higher than for Others. Percent erroneously enumerated 
is higher for renters than for owners, but almost no differences between the race/ethnicity 
groups. Percent substituted is higher for Hispanic and Other renters than for Hispanic and 
Other owners. Asian owners had a higher percent substituted than Asian renters, the reverse 
from the two other race-ethnicity groups. 
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Table 5 
Percent Undercount and the Components of the Dual-System Estimates for the Poststrata 


Percent 
Percent pease Percent 
Poscsifat Percent Nonmatched PEeeeae Substituted 
Se Undercount* of the een of the 
P-Sample Beample” Census 
Hispanic Renters in 

Hispanic Blocks roy Lek 2.6 1.7 
Hispanic Owners in 

Hispanic Blocks 6) 8.1 lee | ies 
Hispanics in Non- 

Hispanic Blocks fi) 9.7 1.4 js 
Asian Renters ih ll 13.4 Del ee 
Asian Owners 4.6 6.8 ee, 5 
Other Renters 9.9 12.9 DRA he7/ 
Other Owners 3.8 5.8 eS 0.9 

0-14 8.8 11.9 XD) 1.6 
15-29 13.6 16.2 | 1.6 
30-44 8.6 10.8 1.4 1.4 
45-64 4.5 6.6 ib3" 1.4 
65+ 358) 5.9 le eS 
Male 9.7 12.1 15 7/ ibaS) 
Female 8.3 10.8 1.9 ibs 
Total 9.0 11.4 1.8 1S) 


@ All estimates are summed over all other catetgories. 
Erroneously enumerated includes unmatchable nonsubstituted. 


Examining Table 5 for percent undercount by age and sex one observes that the age group | 
15-29 had the highest undercount and males have a higher undercount than females. The age | 
groups 0-14 and 30-44 have similar undercount estimates, slightly below average for the test _ 
site. The age groups 45-64 and 65+ also have similar undercount estimates, well below the — 
other age groups. These results are fairly consistent in distribution with previous undercount 
results. | 

Percent erroneously enumerated is highest for the two youngest age group 0- 14 and 15-29. | 
The age groups 30-44 and 45-64 have similar low estimates of percent erroneous enumerated. 
Surprisingly the percent erroneously enumerated is in the middle for the age group 65 + . Small 
differences are observed in percent erroneous enumeration for the sex groups. Only small — 
differences are observed for percent substituted for the age groups or the sex groups. 


4.2 Small Area Estimates 


Before applying the adjustment at the block level, as mentioned earlier, a regression model 
was fitted to ‘‘smooth’’ the data and reduce the effects of sampling variability. The regression — 


| 
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model was fit to the 70 adjustment factors as defined by the poststrata. The regression modelling 
is used to find a common pattern of undercounting in the data. Then the sample-estimated 
adjustment factors are shrunk toward this common pattern. This is similar in spirit to the James- 
Stein estimator and empirical Bayes estimators. The independent variables that were available 
to use in the model were indicator variables for the race-tenure groups, for the age groups, 
and for the sex groups. No interaction terms were allowed to enter the model. The model that 
fit the data and had significant coefficients (under an unweighted regression model) was the 
following: 


Y = 1.038 + .090(HR) + .044(AR) + .013(OR) + .058(A15-29) — .009(A45-64) 
where Y = model-based adjustment factor 


HR = 1 if Hispanic Renter in Hispanic Blocks 
= 0 otherwise 


AR = 1 if Asian Renter in all Blocks 
= 0 otherwise 


OR = 1 if Other Renter in all Blocks 
= 0 otherwise 


A15-29 = 1 if ase group 15-29 
= 0 otherwise 


A45-64 = 1 if age group 45-64 
= 0 otherwise. 


The regression model shows the larger undercount estimates for all renters over owners. 
Also the age group 15-29 has much higher undercount estimates than other age groups. The 
age group 45-64 has lower undercount estimates than the other age groups. The variable sex 
was statistically insignificant and was not included in the model. Two adjustment factors, 
Hispanics in Non-Hispanic blocks male 65 + and Asian renters male 65 + , had a zero estimated 
variance and were not included in the model. The predicted adjustment factor was defined as 
the sample-estimated adjustment factor for these two adjustment factors. 

Table 6 contains the sample-estimated and predicted adjustment factors for the 70 poststrata. 
In general, the predicted adjustment factors lowers the highest estimated adjustment factors 
and raises the lowest estimated adjustment factors. The predicted adjustment factors have less 
variability than the sample-estimated adjustment factors. The most notable example of the 
effects of the regression model is for Asian renters female age 65 + . The predicted adjustment 
factor is 1.087 rather than the sample estimated adjustment factor of 1.212. This predicted 
adjustment factor is closer to the expectations of a lower undercount for the age group 65 + 
than for the other age groups. 

The predicted adjustment factors were multiplied by the census counts for the 2,405 blocks 
in the test site. The adjusted census counts were rounded to form integer values. Although three 
predicted adjustment factors were less than one (an estimated overcount), the integerization 
process did not produce any adjusted overcounts. 

The adjustment process added 32,843 people to the census, a 8.2% undercount rate. If the 
sample-estimated adjustment factors were used, then 36,454 people would have been added 
to the census, a 9.0% undercount rate. The process of smoothing lowered the undercount 
estimate by almost 10%. This occurred because the largest undercount estimates were lowered 
by the smoothing and these same groups had the largest population counts. 
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Poststrata 


HR in HB 
HR in HB 
HR in HB 
HR in HB 
HR in HB 


HR in HB 
HR in HB 
HR in HB 
HR in HB 
HR in HB 


HO in HB 
HO in HB 
HO in HB 
HO in HB 
HO in HB 


HO in HB 
HO in HB 
HO in HB 
HO in HB 
HO in HB 


H in H’B 
H in H’B 
Pine 
H in H’B 
H in H’B 
H in H’B 
H in H’B 
H in H’B 
H in H’B 
H in H’B 


AR in all B 
AR in all B 
AR in all B 
AR in all B 
AR in all B 


AR in all B 
AR in all B 
AR in all B 
AR in all B 
AR in all B 
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Table 6 


Results of Smoothing TARO Adjustment Factors 


Sex/Age 


0-14 
15-29 
30-44 
45-64 
65+ 


0-14 
15-29 
30-44 
45-64 
65+ 


apres) leg \sias ian ole Ke Cee 


0-14 
15-29 
30-44 
45-64 
65 + 


0-14 
15-29 
30-44 
45-64 
65 + 


he eee ee 


0-14 
15-29 
30-44 
45-64 
65 + 


0-14 
15-29 
30-44 
45-64 
65 + 


25 |ea fos healers Kents Ce ad 


0-14 
15-29 
30-44 
45-64 
65 + 


0-14 
15-29 
30-44 
45-64 
65+ 


Sede aber dae la eee a ee 


Adj. 


Factor Y 


ist 
1.247 
1.165 
1.099 
1.055 


1.124 
1.234 
1.084 
Peis 
1.099 


1.056 
1.078 
1.087 
1.031 
1.073 


1.059 
1.088 
1.033 
1.020 
1.033 


1.105 
1.154 
1131 
1.063 
0.991 


eg ay) 
1.033 
1.079 
1.033 
0.947 


1.059 
Ll27 
ied 8) 
1.004 
0.982 


1.067 
ZS 
ht73 
1.012 
1202 


Std. 
Error 


0.020 
0.030 
0.029 
0.043 
0.044 


0.023 
0.032 
0.017 
0.040 
0.045 


0.018 
0.018 
0.016 
0.012 
0.028 


0.020 
0.016 
0.012 
0.012 
0.019 


0.052 
0.054 
0.065 
0.050 
0.000 


0.047 
0.022 
0.037 
0.028 
0.040 


0.041 
0.044 
0.077 
0.057 
0.000 


0.047 
0.055 
0.105 
0.061 
0.127 


Predicted 


Adj. 
Factor AF 


1.130 
bead 
1.144 
1.114 
1.110 


1.126 
1.203 
1.098 
1124 
1122 


1.050 
1.084 
1072 
1.031 
1.054 


E053 
1.090 
1.034 
1.022 
1.035 


1.051 
1.106 
1.050 
1.036 
Ue 


1.059 
1.060 
1.051 
1031 
1.013 


1.076 
| eal oh 
1.093 
1.063 
0.982 


1.079 
yal ie: 
1.087 
1.065 
1.087 


Std. 
Error 


0.016 
0.021 
0.020 
0.024 
0.023 


0.018 
0.022 
0.015 
0.024 
0.024 


0.015 
0.015 | 
0.014 

0.011 
0.019 


0.016 
0.014 
0.011 
0.011 
0.015 


0.023 
0.025 
0.024 
0.023 
0.000 


0.023 
0.017 
0.021 
0.019 
0.022 


0.026 
0.028 
0.031 
0.030 
0.000 


0.028 
0.029 
0.032 
0.030 
0.032 
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Table 6 
Results of Smoothing TARO Adjustment Factors - Concluded 


Estimated Predicted 

Poststrata Sex/Age Adj. Std. Adj. Std. 

Factor Y Error Factor AF Error 
AO in all B M 0-14 1.045 0.030 1.041 0.019 
AO in all B M 15-29 1.059 0.038 1.085 0.022 
AO in all B M 30-44 1.091 0.040 1.053 0.022 
AO in all B M 45-64 1.035 0.020 1.033 0.016 
AO in all B M 65+ 1.031 0.051 1.037 0.023 
AO in all B F 0-14 1.040 0.041 1.039 0.022 
AO in all B Fe 15-29 1.052 0.046 1.086 0.024 
AO in all B F 30-44 1.035 0.036 1.037 0.021 
AO in all B F 45-64 1.038 0.019 1035 0.015 
AO in all B F 65+ 1.051 0.045 1.041 0.022 
O’R in all B M 0-14 1.037 0.059 1.049 0.027 
O’RinallB M 15-29 sy) 0.114 evleles 0.031 
O’Rinall B M 30-44 1.144 0.066 1.062 0.028 
O’Rinall B M 45-64 1FO55 0.031 1.047 0.022 
O’R in all B M 65+ 1.068 0.056 1.054 0.027 
O’R in all B F 0-14 1.148 0.062 1.064 0.027 
O’R in all B F 15-29 1.126 0.054 Le 0.028 
O’Rinall B F 30-44 1.134 0.057 1.064 0.027 
O’R in all B F 45-64 1.068 0.041 1.049 0.025 
O’R in all B F 65+ 0.948 0.021 0.992 0.018 
O’O in all B M 0-14 1.044 0.037 1.040 0.021 
O’O in all B M 15-29 1.148 0.064 1.103 0.025 
O’O in all B M 30-44 1.006 0.048 1.032 0.023 
O’O in all B M 45-64 1.036 0.017 1.034 0.014 
O’O in all B M 65+ 1.017 0.019 1.025 0.016 
O’O in all B F 0-14 1.159 0.068 O52 0.024 
O’O in all B EF 15-29 1.081 0.042 1.092 0.023 
O’O in all B F 30-44 0.997 0.017 1.011 0.014 
O’O in all B F 45-64 1.025 0.012 1.026 0.011 
O’O in all B F 65+ 0.997 0.012 1.004 0.011 


Note: H: Hispanic, R: Renter, B: Block, M: Male, F: Female, O: Owner, O’: Other, H’: Non-Hispanic, A: Asian. 
(Example: HR in HB: Hispanic Renter in Hispanic Block) 


To summarize the block level adjustments, figure 1 shows the number of persons added 
by the number of blocks and figure 2 shows the percent of persons added by the number of 
blocks. Almost 80% of the blocks added less than 20 persons. Only 2 blocks added more than 
150 persons. Those 2 blocks were fairly large, containing about 2,000 people each. Over 80% 
of the blocks had undercount estimates ranging from 4% to 12%. Many of the small blocks 
added a small percent of persons because the estimates were rounded down making a large 
change in the percent. The blocks with largest percent added were largely Hispanic and renters 
which had the largest predicted adjustment factors. 
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5. CONCLUSION 


This paper discusses the methodology, operations, and the results of the Test of Adjust- 
ment Related Operations. TARO tested the operational and timing aspects of adjusting the 
census for estimated persons missed in the enumeration of the population. 

The results from TARO demonstrate that undercount estimates can be produced in a timely 
manner. TARO was completed earlier than any previous PES. 

TARO measured an undercount of 9% for the Central Los Angeles Count’s test census. 
Separate dual-system estimates are presented for 70 race-tenure by age by sex categories. The 
dual-system estimates were smoothed by fitting a regression model to the estimates and then 
the resulting estimates were carried down to the block level. The use of block level undercount 
estimates allows aggregation to any level above the block. 

Evaluation of the operations and assumption of the estimators are given in Schenker (1988) 
and Hogan and Wolter (1988). Together with this paper, they demonstrate a thorough evalua- 
tion of the census counts and the undercount estimates of the test census. 
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Handling Missing Data in Coverage Estimation, with Application 
to the 1986 Test of Adjustment Related Operations 


NATHANIEL SCHENKER! 


ABSTRACT 


This paper discusses methods used to handle missing data in post-enumeration surveys for estimating 
census coverage error, as illustrated for the 1986 Test of Adjustment Related Operations (Diffendal 1988). 
The methods include imputation schemes based on hot-deck and logistic regression models as well as 
weighting adjustments. The sensitivity of undercount estimates from the 1986 test to variations in the 
imputation models is also explored. 


KEY WORDS: Imputation; Nonresponse; Post-enumeration survey; Weighting adjustments; Undercount. 


1. INTRODUCTION 


Missing data can be a major source of uncertainty in the estimation of coverage error for 
the decennial censuses in the United States (Freedman and Navidi 1986; Fay, Passel, and 
Robinson 1988, Chapter 6). For both the 1960 and 1980 Decennial Censuses, several estimates 
of coverage error were computed under different treatments of the missing data. 

The Bureau of the Census has conducted many tests of methods for coverage error estima- 
tion to prepare to handle missing data and other problems for the 1990 Decennial Census. One 
such test was the 1986 Test of Adjustment Related Operations (TARO) (Diffendal 1988), which 
used the 1986 Census of Central Los Angeles County. Changes in field methodology and design 
for TARO reduced the levels of certain types of missing data from the levels for 1980 (Hogan 
and Wolter 1988). Nevertheless, some missing-data problems remained. 

This paper describes the missing-data problems in TARO and how they were handled in 
the estimation process. Section 2 gives a brief description of how coverage error was estimated 
in TARO. Sections 3-6 discuss the types of missing data that occurred, the extent to which they 
occurred, and the methods used to handle them. These methods include a weighting adjust- 
ment for unit nonresponse (noninterviews), hot-deck imputation for missing demographic and 
housing characteristics, and imputation using logistic regression models for certain binary items 
related to enumeration in the census. Section 7 presents coverage error estimates under alter- 
native imputation models and alternative treatments of certain problem cases. The lowest and 
highest estimated undercount rates obtained using these alternatives are 8.50% and 10.16% 
for Hispanics, 5.86% and 7.81% for Asian non-Hispanics, and 5.81% and 6.59% for Others. 
The estimates from TARO for the three race categories were 9.85%, 7.32%, and 6.21%, respec- 
tively. A concluding discussion is given in Section 8. 


2. ESTIMATING CENSUS COVERAGE ERROR 


Diffendal (1988) discusses in detail how census coverage error was estimated in TARO. This 
section describes briefly those aspects necessary for understanding the rest of this paper. 


! Nathaniel Schenker, Undercount Research Staff, Statistical Research Division, Bureau of the Census, Washington, 
DC 20233, USA. This paper reports research undertaken by a member of the Census Bureau’s staff. The views 
expressed are attributable to the author and do not necessarily reflect those of the Census Bureau. 
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Coverage error was estimated using data from a post-enumeration survey (PES) of people in 
the census site. First a sample of blocks in the site was drawn. Then each housing unit in the 
sample blocks was surveyed to determine its occupants on Census Day, its occupants at the 
time of the PES and where they lived on Census Day, and the characteristics of the occupants. 

Two samples were used to estimate census coverage error. The P (population) sample was 
composed of the people who lived in the PES sample blocks at the time of the PES. An attempt 
was made to match each P-sample person to a person enumerated in the census to determine 
whether the P-sample person had been enumerated; the match rate within each domain of study 
was used essentially to estimate the capture rate of the census for that domain. The E (enumera- 
tion) sample was composed of the people who were enumerated in the census as living in the 
PES sample blocks; this sample was used to estimate the number of erroneous enumerations 
(e.g., fictitious enumerations and duplicates) and unmatchable persons (e.g., persons for whom 
no names were reported) in the census within each domain. An attempt was made to match 
each E-sample person to a person in the PES. Each E-sample match was considered a correct 
enumeration since the PES indicated that the person should have been enumerated. Each E- 
sample nonmatch was followed up to determine whether it was an erroneous enumeration or 
a correct enumeration that was missed in the PES (which is not itself assumed to have perfect 
coverage). 

If a PES of the entire United States were conducted, individuals in the P-sample who moved 
out of Central Los Angeles County between Census Day and the PES would be interviewed 
in the PES. An attempt to match these individuals to census enumerations in Central Los 
Angeles County would be made, and the resulting data would be used in the estimation of cov- 
erage error for Central Los Angeles County. Similarly, individuals in the P-sample who moved 
into Central Los Angeles County between Census Day and the PES would contribute to cov- 
erage error estimates outside of Central Los Angeles County. However, because the census 
and PES for TARO were conducted only in Central Los Angeles County and not in the entire 
United States, outmovers from the test site were not interviewed in the PES and inmovers did 
not apply to the test. Thus data for inmovers and outmovers were not used in the estimation. 
(Note that data for movers within test site were used, however). This issue is discussed further 
in Section 7.2. 

The ‘‘dual-system”’ estimator of the population size (see Marks, Seltzer, and Krotki 1974, 
Krotki 1978, and Wolter 1986 for discussion and references) is written 


DSE = N, (CEN-SUB-EE)/M, (1) 


where N, is the weighted number of people in the P-sample, CEN is the unadjusted census 
count, SUB is the number of whole-person substitutions (for unit nonresponse) in the census, 
EE is a weighted estimate of the number of erroneous enumerations and unmatchable persons 
in the census, and M is the weighted number of matches between the P-sample and census; 
census data provide CEN and SUB, whereas P- and E-sample data provide Np, EE, and M. 
The dual-system estimator can be thought of as inflating the estimated number of correct and 
matchable census enumerations (CEN-SUB-EE) by the inverse of the estimated census cap- 
ture rate (M/N,). 

The theory of dual-system estimation assumes that for both the census and the PES, the 
probability of capture is constant across all people in the domain to which the estimator is 
applied (Wolter 1986). Thus no one group of people in the domain should be more or less likely 
to be enumerated in the census or PES than any other group. To make this assumption more 
realistic in TARO, separate dual-system estimates were computed within poststrata based on 
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person and housing characteristics. The poststrata are described in Diffendal (1988). One 
example is the Hispanic male renters of ages 30 to 44 living in primarily Hispanic blocks. 

To summarize, the P- and E-sample data needed for coverage error estimation were the 
match status (match vs. nonmatch) for each P-sample person, the enumeration status (correct 
vs. erroneous) for each E-sample person, and person and housing characteristics for each person 
in both samples. 


3. P-SAMPLE HOUSEHOLD NONINTERVIEWS 


Occasionally, a PES interviewer was unable to obtain an interview for an occupied housing 
unit; this occurred, for example, when the occupants refused to respond. Of the 5,935 housing 
units that were judged to be nonvacant, 32 (0.5%) were classified as having household noninter- 
views. The occurrence of household noninterviews resulted in missing data on the number of 
people in each household, person and housing characteristics, and match statuses. 

The block-sample design of the PES afforded a simple way to handle P-sample household 
noninterviews. Within each sample block, the sampling weights of the noninterview households 
were redistributed across the interviewed households. The noninterview weighting adjustment 
basically assumes that the distributions of people, characteristics, and match statuses for 
households not interviewed within a block are the same as for households interviewed. This 
assumption was used because households tend to be more similar within blocks than across 
blocks, although noninterview households still probably differ somewhat from interviewed 
households, especially with respect to household size (see, e.g., Palmer 1967). 

It is possible that the data obtained for a household by proxy interview (which in TARO 
referred to a completed interview with someone outside the household) are of sufficiently low 
quality that such a household should be classified as a noninterview household. The quality 
of data from the 189 proxy interviews in TARO is discussed in Section 4, and some coverage 
error estimates with proxy interviews treated as noninterviews are presented in Section 7. 


4. MISSING CHARACTERISTICS IN THE P- AND E-SAMPLES 


Even when an interview was obtained for a P-sample household, the data on person and 
housing characteristics were sometimes incomplete. Incomplete data on characteristics also 
occurred in the census and therefore in the E-sample. 

The variables used in poststratification for TARO (Diffendal 1988) included the housing 
variable Tenure (1 = owned, 2 = rented or occupied without payment) and the person variables 
Sex (1 = male, 2 = female), Age (1 = 0-14, 2 = 15-29, 3 = 30-44, 4 = 45-64, 5 = 65+), and 
Race (1 = Hispanic, 2 = Asian non-Hispanic, 3 = Other). In addition, the housing variable 
Structure (1 = single-unit, 2 = multiunit) was used in handling missing P-sample match statuses 
and missing E-sample enumeration statuses (see Sections 5 and 6). 

Table 1 displays the missing-characteristic counts for the entire P- and E-samples and for 
cases coming from P-sample proxy interviews. For the P- and E-samples, the highest missing- 
data rate was 7.0% for E-sample Race, with all other rates being 3.5% or lower. The missing- 
data rates for P-sample proxy cases were all several times higher than those for the entire 
P-sample, although only Tenure (20.2%) had a rate higher than 10%. 

Missing characteristics for each of the samples (P and E) were imputed by a hot-deck method 
involving two passes through the data after the data had been sorted geographically. On the 
first pass, missing values of Tenure, Structure, and Race were imputed using the most 
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Table 1 


Missing-Characteristic Counts (% in Parentheses) 
for the Entire P- and E-Samples and for P-Sample Proxy Interviews 


Vatiable P-Sample E-Sample P-Sample Proxy 
(19,552 persons) (20,976 persons) (430 persons) 
Tenure 690 (3.5) 154 (0.7) 87 (20.2) 
Structure 459 (2.3) 343 (1.6) 38 = (8.8) 
Sex 418 (2.1) 82 (0.4) 18 (4.2) 
Age 187 110%) 4321(2:1) LS vini(422) 


Race 155 (0.8) 1463 (7.0) 17. (4.0) 


NOTE: The 19,552 persons in the P-sample include the 430 proxy cases. 


recent observed data, because of the presumed strong relation between these variables and 
geography. In addition, distributions of Sex and Age were tabulated for categories of type of 
household (single-person vs. multiperson), marital status, relationship to head of household, 
and sex and age of head of household, using all observed data. On the second pass, missing 
values of Sex and Age were imputed at random from the distributions tabulated during the 
first pass. Further details on the imputation of characteristics in TARO can be found in 
Schenker (1987). 

In summary, the block sample design of the PES was helpful not only in developing a 
noninterview weighting scheme (Section 3), but also in the imputation of characteristics that 
tend to be clustered by block, that is, Tenure, Structure, and Race. 


5. MISSING MATCH STATUSES IN THE P-SAMPLE 


Of the 19,552 P-sample cases resulting from completed interviews, 161 (0.8%) were missing 
match statuses for dual-system estimation. All but three of these unresolved cases fell into two 
broad categories: 105 cases for which matching was not attempted due to incomplete names 
and/or insufficient characteristics; and 53 movers between Census Day and the PES for whom 
there were problems specifying a Census Day address or finding the census questionnaire for 
the Census Day address. 

A traditional approach to handling a missing binary item such as match status is to impute 
one of the two possible outcomes for the missing item. For example, in the estimation of under- 
count for the 1980 Decennial Census, the match status for each unresolved P-sample case was 
imputed from a resolved case with similar characteristics (Fay, Passel, and Robinson 1988, 
Chapter 6). A different approach was taken in TARO, however. After all missing characteristics 
were imputed using the methods described in Section 4, a match probability was imputed for 
each unknown match status; the probability was estimated using an explicit model (to be 
described later in this section). The contribution of the unresolved cases to the M term of the 
dual-system estimate (1) was the weighted sum of the imputed probabilities. 

Probabilities rather than binary outcomes were imputed for two reasons. First, imputing 
random binary outcomes is less efficient than imputing estimated probabilities, yielding 
estimates with higher variances (see Rubin 1987, p. 15). Second, because imputed probabilities 
represent uncertainty about the missing match statuses, it should be possible to use the pro- 
babilities to obtain a variance due to imputation. Note, however, that since the dual-system 
estimator (1) is nonlinear in M, imputing a probability (or mean) for each missing binary 
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outcome introduces some bias into the estimation (see Rubin 1987, p. 14). Current research 
is investigating the use of imputed probabilities for missing binary data. 

The following logistic regression approach was used to impute match probabilities. Let X 
denote a vector of predictors, Y = match or nonmatch, and p = Pr( Y=match| X). The 
parameter vector 6 of the logistic regression model 


logit(p) = log[p/(1—p)] = X’8B 


was estimated from the data for the resolved cases using the Bayesian techniques for categorical 
logistic regressions described in Rubin and Schenker (1987); these techniques involve adding 
fractional observations to each cell in the logistic regression and then fitting the model by stan- 
dard maximum-likelihood methods. Then for unresolved case j, with X = x;, the imputed 
match probability was 


p; = logit! (x/8) = exp(x/8)/[1 + exp(x/8)1, 


where @ denotes the estimate of 8. The background variables used to define X were Tenure, 
Structure, Sex, Age, and Race, as well as variables indicating regular interview versus proxy 
interview and mover versus nonmover between Census Day and the PES. 

Table Al (in the Appendix) gives the logistic regression coefficient estimates. The large coef- 
ficients associated with interview and mover status indicate that proxy and mover cases have 
much lower imputed match probabilities than others. It may be that these lower match pro- 
babilities are due in part to difficulties in matching proxy and mover cases rather than just 
lower census capture rates for these cases. If this is true, alternative treatments of the data may 
be in order; such alternatives are considered in Section 7. 

Of the 19,391 resolved P-sample cases, 17,018 (87.8%) were matches. The (unweighted) sum 
of the 161 imputed match probabilities was 124.66; thus the imputed match rate was 77.4%. 
Although a stratified sample of blocks was used in TARO, the estimation of the logistic regres- 
sion parameters assumed a simple random sample of people. To examine the possible biases 
due to not accounting for the stratification, the logistic regression was fitted again (after TARO 
was completed) with indicator variables for the six sampling strata (Diffendal 1988) included 
in X. The result of this refinement is a sum of imputed match probabilities equal to 124.50 
(77.3%). The minor effect of this change on estimates of census coverage error is demonstrated 
in Section 7. Implications of possible design effects due to clustering are discussed in Section 8. 


6. MISSING ENUMERATION STATUSES IN THE E-SAMPLE 


Of the 20,976 cases in the E-sample, 3,714 were followed up or should have been followed 
up. After followup, 979 cases (4.7% of total, 26.4% of followup) had missing enumeration 
statuses. All but nine of these unresolved cases fell into four broad categories: 498 cases that 
should have been followed up but were not; 257 cases in which the respondent to the followup 
interview did not know the person in question; 137 cases for which the interview yielded insuf- 
ficient information to determine an enumeration status; and 78 cases for which there were 
followup noninterviews. 

Missing enumeration statuses in the E-sample were handled by imputing a probability of 
erroneous enumeration for each unresolved case. The contribution of the unresolved cases to 
the EE term of the dual-system estimate (1) was the weighted sum of the imputed probabilities. 
The imputation procedure was analogous to that used for P-sample match statuses with one 
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major change: Since missing enumeration statuses resulted solely from followup, only the 
resolved cases from followup were used in estimating the logistic regression. The background 
variables used to define X for the logistic regression were Tenure, Structure, Sex, Age, and 
Race, along with variables indicating whether the census questionnaire for the person’s 
household was returned by mail and whether the entire household or only part of the household 
was not matched before followup. Table A2 (in the Appendix) gives the logistic regression coef- 
ficient estimates. 

Of the 17,262 non-followup cases, 278 (1.6%) were classified as erroneous enumerations 
or unmatchable. There were 2,735 resolved followup cases, of which 82 (3.0%) were classified 
as erroneous enumerations. The (unweighted) sum of the 979 imputed probabilities was 21.93 
(2.2%). When indicator variables for the sampling strata are included in_X, the sum changes 
to 23.58 (2.4%). As with the P-sample, this change has a very minor effect on estimates of 
coverage error; see Section 7. 


7. ESTIMATES OF COVERAGE ERROR UNDER ALTERNATIVE TREATMENTS 
OF MISSING DATA AND OTHER PROBLEM CASES 


This section examines the effects of alternative treatments of missing data and other problem 
cases on estimates of coverage error for the three categories of race defined by the variable 
Race (Hispanic, Asian non-Hispanic, and Other). For a given treatment and race category, 
let N be the sum of the dual-system estimates over all poststrata corresponding to the race 
category and let N. be the sum of the unadjusted census counts over the poststrata. The 
estimated undercount rate is then 100(1 — N./N)%. 

Consider first the alternative of including indicators of the sampling strata as predictors 
in the P- and E-sample logistic regressions for imputing match and erroneous enumeration 
probabilities, as discussed in Sections 5 and 6. The estimated undercount rates from TARO, 
which were obtained without using these predictors, are 9.85% for Hispanics, 7.32% for Asian 
non-Hispanics, and 6.24% for Others. When indicators of the sampling strata are used, the 
estimates change to 9.82% for Hispanics, 7.31% for Asian non-Hispanics, and 6.21% for 
Others. The largest difference due to including the sampling stratum indicators is only 0.03%. 
For all the alternative treatments to be considered, however, this refinement is used because 
it is in principle more correct; for instance, it should yield more accurate standard errors. 


7.1 Treatments that Lower the Estimated Undercount 


The match rate for the 375 resolved P-sample proxy cases was 78.9% as opposed to the 
overall P-sample rate of 87.8%. While it may be true that proxy cases were actually captured 
in the census less frequently than others, it is possible that part of the difference in the match 
rates is due to missing and/or incorrect proxy data (see Section 4). A conservative treatment 
would be to classify the 189 proxy interviews as household noninterviews and apply the 
weighting adjustment described in Section 3; this would essentially assign proxy cases the same 
match rate as nonproxy cases. (Note that when all proxy interviews are classified as noninter- 
views, an indicator of proxy/nonproxy status is no longer included in the logistic regression 
model for imputing match probabilities). 

The match rate for the 277 resolved P-sample movers (between Census Day and the PES) 
was 66.1%. It is generally believed that movers are captured in the census at a lower rate than 
nonmovers, but it may be that the low match rate for movers is partly due to difficulties inherent 
in matching movers, such as problems in obtaining a correct Census Day address. A conservative 
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Table 2 


Estimated Undercount Rates (in %) by Race Under Alternative Treatments 
of P-sample Proxy Interviews, P-sample Movers, and E-sample W1’s 


Treatment 
(1 = alternative, 0 = TARO) Hispanic Asian _ Other 
non-Hispanic 
Proxy Mover Wi 

0 0 0 9.82 TeSt 6.21 
0 0 1 9.30 6.76 5.83 
0 1 0 9.33 7.24 6.19 
0 1 1 8.80 6.69 5.81 
1 0 0 Hos) 6.52 6.24 
1 0 1 9.03 5.96 5.86 
1 1 0 9.04 6.45 Or22 
1 1 1 8.51 5.90 5.84 


NOTE: Indicators of the sampling strata were used as predictors in the logistic regressions for imputing match and 
erroneous enumeration probabilities. 


treatment would be to classify all cases for movers as unresolved and then impute match pro- 
babilities for unresolved cases using a logistic regression model that does not include 
mover/nonmover status as a predictor. This would essentially assign movers the same match 
rate as nonmovers. 

Of the 979 unresolved E-sample cases, 257 had the followup interview code W1, meaning 
that the respondent did not know the person in question. A code of W1 could have indicated 
that the person in question was fictitious. Therefore, after TARO, all W1’s were reviewed by 
experienced matching personnel. Any case that showed evidence (such as a note from the inter- 
viewer) of possibly being fictitious was marked; there were 118 such cases. An alternative treat- 
ment to that used in TARO would be to classify the 118 cases as resolved erroneous enumer- 
ations before imputation. This would raise both the observed and imputed rates of erroneous 
enumeration. 

Table 2 displays the undercount estimates by race category for the 2x2x2 factorial design 
with the factors being whether or not alternative treatments are used for proxy interviews, 
movers, and W1’s. The ranges between the lowest and highest estimated undercount rates are 
1.31% for Hispanics, 1.41% for Asian non-Hispanics, and 0.43% for Others. 

Note that for each race category, there is not much interaction between the treatments of 
proxy interviews, movers, and W1’s. In fact, the following simple additive model can be used 
to predict the entries in Table 2 for each race category: 


Y = do + 1,8) + ImQm + IyGy, (2) 


where Y is the predicted estimate of the undercount rate, T,, Im and I, are the treatmeant 
indicators (1 = alternative, 0 = TARO) for proxy interviews, movers, and W1’s, respectively, 
and Go, Gp, &», and d,,, are parameter estimates given in Table 3. The parameter ap is the 
estimated undercount rate when no alternative treatments are used; Gps a, > andia,, arethe 
effects of using alternative treatments for proxy interviews, movers, and W1’s, respectively. 
The largest residual when equation (2) is used to predict the entries in Table 2 is 0.02%. 
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Table 3 


Parameter Estimates for the Additive Model (2) for Predicting 
the Estimated Undercount Rates in Table 2 


Asian 


Hispanic ACEH panic Other 
Qo 9.82 Prey 6.21 
a, ='0.28 — 0.7925 0.03 
Om — 0.505 — 0.0675 — 0.02 
ay — (0.525 — 0.5525 — 0.38 


7.2 A Procedure that Raises the Estimated Undercount 


Because TARO was confined to one small area in the United States, no PES data could be 
obtained for people who moved out of the test site between Census Day and the PES. The omis- 
sion of these outmovers from estimation was equivalent to assuming that they had the same 
capture rate in the census as the included cases. This was a conversative assumption, since 
movers are generally believed to have a lower capture rate than nonmovers. 

There were 409 people who moved into the test site between Census Day and the PES. 
These inmovers were not included in the estimation because their Census Day addresses 
were outside the test site and thus their data applies to other areas. Moreover, there were no 
census cases to which to match the inmovers since they were outside the test site on Census 
Day. 

A procedure that might indicate the effect of including outmovers in the estimation would 
be to include the 409 inmovers as substitutes and impute match probabilities for them (since 
their match statuses are unknown). The treatments yielding the highest and lowest estimates 
in Table 2 have been applied to the TARO data with inmovers included; the results are displayed 
in Table 4. Note that the lower estimated undercount rates in Table 4 (obtained using the alter- 
natives to the TARO treatments for proxy interviews, movers, and W1’s) are all within 0.04% 
of the corresponding estimates in Table 2. This result is expected, since the addition of cases 
having an imputed match rate that is approximately the same as the overall match rate should 
not affect the estimates much. The higher estimates in Table 4 are larger than the correspon- 
ding estimates in Table 2 by 0.34% for Hispanics, 0.50% for Asian non-Hispanics, and 0.38% 
for Others. 


Table 4 


Estimated Undercount Rates (in %) by Race When Inmovers are 
Included in the Data with Imputed Match Probabilities 


Treatment 
(1 = alternative, 0 = TARO) Hispanic Asian — Other 
ee non-Hispanic 
Proxy Mover Wi 
0 0 0 10.16 7.81 6.59 
1 1 1 8.50 5.86 5.81 


NOTE: Indicators of the sampling strata were used as predictors in the logistic regressions for imputing match and 
erroneous enumeration probabilities. 
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8. SUMMARY AND DISCUSSION 


A combination of weighting and (random and nonrandom) imputation methods was used 
to handle missing data in TARO. P-sample household noninterviews were handled by a block- 
level weighting adjustment. A hot-deck imputation method was used for missing characteristics 
in both samples. Missing P-sample match statuses and E-sample enumeration statuses were 
handled using imputed probabilities estimated by logistic regression methods. 

As mentioned in Sections 5 and 6, the use of imputed probabilities for missing P-sample 
match statuses and E-sample enumeration statuses should facilitate the assessment of 
variability due to imputing these statuses. To assess this variability completely, it is necessary 
to measure variability due to estimating the logistic regression parameters as well as the 
variability due to imputation given 8 (Rubin and Schenker 1986). Thus an estimated variance- 
covariance matrix for B is needed. Since a cluster sample was used in TARO, the logistic 
regression estimation procedures (Section 5), which assume a simple random sample, do not 
provide an accurate estimate of the variance-covariance matrix. This was not a major con- 
cern in TARO, because the measurement of imputation variance was not a primary goal. 
Moreover, for the nonresponse rates achieved in TARO, the variability due to uncertainty 
in estimating @ is likely to be minor relative to the uncertainty due to imputation given 6 
(Rubin and Schenker 1986). 

Although it is possible in principle to assess the variability due to imputing match and 
enumeration statuses using the TARO procedures, variability due to imputing missing 
characteristics (Section 4) cannot be quantified. One way to make the quantification of such 
variability possible would be to multiply impute characteristics in the P- and E-samples (Rubin 
1987). Several dual-system estimates would then need to be calculated, however — one for 
each set of imputations. 

The models underlying the weighting and imputation methods used in TARO assume that 
given the observed data, the chance of a variable being missing does not depend on its value. 
Another issue regarding imputation is how best to impute characteristics and match statuses 
(or enumeration statuses) simultaneously. The TARO procedure of first imputing 
characteristics and then imputing statuses conditional on the imputed characteristics assumes 
that statuses are not useful predictors for imputing characteristics. Models that relax the TARO 
assumptions may be more appropriate. Rubin, Schafer, and Schenker (1988) discuss this 
further. 

Missing data are only one source of error in estimating coverage. Other sources, such as 
matching error and violations of the assumption of constant capture probabilities (Section 
2), are discussed in Hogan and Wolter (1988). After assessing all of these sources of error 
for TARO, Hogan and Wolter conclude that the TARO coverage measurement is more 
accurate than the original enumeration. 
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APPENDIX 


LOGISTIC REGRESSION RESULTS 


Table Al 


Results for P-Sample Logistic Regression 


Codes 


1 if regular, —1 if proxy 

1 if nonmover, — 1 if mover 

1 if owner, — 1 otherwise 

1 if single-unit, —1 if multiunit 

1 if male, —1 if female 

1 if 0-14, —1 if 65+, 0 otherwise 

1 if 15-29, —1 if 65+, 0 otherwise 

1 if 30-44, —1 if 65+, 0 otherwise 

1 if 45-59, —1 if 65+, 0 otherwise 

1 if Hispanic, —1 if Other, 0 if Asian non-Hispanic 
1 if Asian non-Hispanic, —1 if Other, 0 if Hispanic 


Table A2 


Results for E-Sample Logistic Regression 


Codes 


1 if mail-return, — 1 otherwise 


1 if partial-household match, 
— 1 if whole-household nonmatch 


1 if owner, — 1 otherwise 

1 if single-unit, — 1 if multiunit 

1 if male, —1 if female 

1 if 0-14, —1 if 65+, 0 otherwise 

1 if 15-29, —1 if 65+, 0 otherwise 

1 if 30-44, —1 if 65+, 0 otherwise 

1 if 45-59, —1 if 65+, 0 otherwise 

1 if Hispanic, — 1 if Other, 0 if Asian non-Hispanic 
1 if Asian non-Hispanic, —1 if Other, 0 if Hispanic 


Estimated 
Coefficient 
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Measuring Accuracy in a Post- 
Enumeration Survey 


HOWARD HOGAN and KIRK WOLTER! 


ABSTRACT 


The U.S. Bureau of the Census will use a post-enumeration survey to measure the coverage of the 1990 
Decennial Census. The Census Bureau has developed and tested new procedures aimed at increasing 
the accuracy of the survey. This paper describes the new methods. It discusses the categories of error 
that occur in a post-enumeration survey and means of evaluation to determine that the results are 
accurate. The new methods and the evaluation of the methods are discussed in the context of a recent 
test post-enumeration survey. 


KEY WORDS: Census; Undercount; Overcount; Coverage Evaluation. 


1. INTRODUCTION 


In this article we discuss recent research at the U.S. Bureau of the Census to improve the 
accuracy of a post-enumeration survey and to measure that accuracy. Much of this research 
was originally directed toward the goal of developing a sound body of statistical theory, 
methods, and operations for correcting U.S. census figures for coverage errors. The results 
presented in this paper show that we are now able to produce PES estimates of total popula- 
tion that are closer to the true population than are original census estimates. 

In light of a policy decision made by the U.S. Department of Commerce not to correct 
the 1990 enumeration for coverage error, the PES methods we discuss will be used to provide 
a careful evaluation of the coverage of the 1990 Census. See U.S. Department of Commerce 
(1987). This evaluation will be used to inform users of the limitations of the census, to inform 
planning for future censuses, or to improve the Census Bureau’s estimates of the U.S. popula- 
tion for years subsequent to the census year. 

The PES method uses two samples to measure net coverage error. A sample of people who 
should have been counted in the original census enumeration is interviewed after the census 
and is used to measure census omissions. We call this the population or ‘‘P’’ sample. One 
also needs a sample of census enumerations to measure duplicates and other errors included 
in the census count. We call this the enumeration or ‘‘E’’ sample. The samples form an estimate 
of total population using the dual system-estimator (DSE). See Diffendal (1988) for a full 
discussion of the samples and the dual-system model. Unless otherwise stated, we will use 
Diffendal’s notation throughout this article. 

The Census Bureau conducted a PES in conjunction with the 1980 Census. The P sample 
consisted of persons in households enumerated in the April and August Current Population 
Survey (CPS) samples. For a description of the CPS, see U.S. Bureau of the Census (1978). 
The E sample was a separate and independent sample of persons in housing units enumerated 
in the census. In addition, the Census Bureau produced an alternative set of undercount 
estimates based upon an aggregate analysis of birth and death registration data, administrative 
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records, and previous censuses. This program, called demographic analysis, will be referred 
to occasionally in this article. The Census Bureau did not correct the 1980 enumeration for 
undercount errors because we considered the PES estimates to be flawed by missing and inac- 
curate data. In addition, the demographic analysis results were flawed by, among other things, 
a lack of data on the number of undocumented immigrants and the lack of an acceptable method 
to carry the estimates down to the state and local level. See Fay ef a/. (1988). 

In very recent years, we have developed a new PES design and new methodology that 
minimizes the problems experienced in 1980, while not creating major new ones. The new PES 
design is based on a common area sample of census blocks for both the P and E samples. The 
P sample consists of all people living in the sample blocks at the time of PES interviewing. 
Interviewers visit each housing unit and determine where the residents were living at the time 
of the census. 

Using newly developed computer matching methods and software (Jaro 1988), we attempt 
to match all P-sample people to corresponding census enumerations. Clerks review the com- 
puter’s work and make a final determination as to the enumeration status (either enumerated 
or missed in the original enumeration) of each P-sample person. For people who moved between 
the census and the PES, we assign the census-day address to the proper block and search for 
a match there. For a few cases, matching is indeterminate at this point, and a further inter- 
view or followup is necessary either to gather additional information or to resolve conflicts 
in existing information. After the followup, clerks assign an enumeration status to the P-sample 
people for whom the followup interview is complete. For a very few residual cases, matching 
may be still unresolved, and we impute to each an enumeration status, using appropriate 
statistical techniques for missing data (Schenker 1988). 

For each E-sample person, a determination is made as to the person’s enumeration status 
(either correctly enumerated or erroneously enumerated) in the original census. Section 6 gives 
a description of what constitutes an erroneous enumeration (EB), and all non-erroneous enumera- 
tions are considered correct enumerations (CE). In many cases, the census enumerates the same 
people that are interviewed as part of the P sample. Thus, the two samples overlap to a great 
extent. Most E-sample people who are also in the P sample (as determined by the computer and 
clerical matching system) are automatically declared CE. However, the overlap is not complete. 
The P sample will miss some people that are included in the E sample and vice versa. The census 
will enumerate others in the block by mistake. Interviewers will invent some enumerations. For 
all E-sample people who are not matched to a P-sample person, it is necessary to conduct a 
followup interview. This followup gathers enough information to allow a determination of 
whether the E-sample people were counted correctly in the original enumeration. 

We tested the new PES design in 1986 in connection with a test census in Los Angeles. The 
test was called the Test of Adjustment Related Operations (TARO) and consisted of 190 blocks, 
containing almost six thousand housing units and 20,000 people. The estimated net undercount 
for the Los Angeles test was about 9 percent. For details on TARO methods and results; see 
Diffendal (1988) and Schenker (1988). 

We also tested the new PES design in a rural area of Mississippi during 1986. There we used 
a sample of 271 blocks with about 3250 housing units and eight thousand people. The estimated 
undercount in this test was 5.5 percent. For details of results and methodology, see Anolik 
(1988). Although, the Mississippi test data have not been as completely analyzed as the TARO 
data, we will refer occasionally to the results in this article. 

An important question is whether the new PES can produce more accurate estimates of 
population than can the original census enumeration. In theory, the PES estimates should be 
considered the more accurate, but in practice, nonsampling errors can and do arise in the 
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Table 1 


TARO Errors and Estimates of the Mean Effect on the 
Estimated Undercount of Correcting the Error 


Mean Effect on 


Sources of Error Estimated 
Undercount 
Matching error — 1.0% 
Reporting census-day address — 1.0% 
Fabrication in the PES interview — 1.0% 
Missing data 0.0% 
Error in measuring the erroneous enumerations —0.5% 
Balancing gross overcounts and undercounts 0.0% 
Correlation bias +2.3% 
Random error 0.0% 


conduct and analysis of both the PES and the census enumeration. Careful study is needed 
to assess their relative accuracies. In this article, we present our assessment of the error struc- 
ture of the 1986 TARO. 

Eight potential sources of error affect coverage measurements produced by the PES: 
sampling error plus seven sources of nonsampling error. The sources and our summary assess- 
ment of their impact on TARO data are presented in Table 1. The second column gives the 
effects of the errors on the estimated undercount. For example, if we correct all ‘‘matching 
errors,’’ the estimated undercount would be reduced by about one percentage point, from 9 
percent to 8 percent. Some errors, such as ‘‘missing data’’ and ‘‘random error’’, might either 
raise or lower the undercount, and our best assessment is that these errors introduce no impor- 
tant bias into TARO data. The figures in this column represent assessments of individual error, 
without regard for the other sources of error. 

By construction, the eight individual errors tend to be mutually exclusive and additive. Some 
overlaps or interactions are possible between the different sources, but we believe they are small 
and we ignore them here. Overall, we calculate the joint effect of the errors as 


(-—1.0 — 1.0 — 1.0 + 0.0 — 0.5 + 0.0 + 2.3 + 0.0) percent = —1.2 percent. 


Thus, correcting for the joint effect of the errors would lower the estimated undercount from 
9.0 percent to about 7.8 percent. The corrected figure, 7.8 percent, may be viewed approx- 
imately as the mean of a posterior error distribution for the TARO undercount. Development 
of a complete posterior error distribution is proceeding at the Census Bureau (see Mulry and 
Spencer 1988). 

Because the original TARO estimate of 9 percent is much closer to the corrected figure of 
7.8 percent than the corrected figure is to zero, we conclude that the original TARO data is 
closer to the truth than is the original census enumeration. 

In the next 8 sections of the article, we treat the error components one by one. Each section 
discusses both the procedures and problems confronted in the 1980 PES, and the error-resistant 
improvements that were tested in TARO. We describe the evaluation of each error compo- 
nent and the evidence for our conclusions. The paper closes in Section 10 with a summary of 
our findings and some directions for future research. 
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2. MATCHING ERROR 


Errors in classifying P-sample people as enumerated or not can occur for two general reasons: 


(a) the information reported by the respondent/interviewer is incorrect 
(b) correct information is reported, but not correctly used. 


Category (a) consists of errors in the reporting of census-day address and fabrication in 
the PES interview, discussed in Sections 3 and 4, respectively. The present section discusses 
matching errors (category (b)) that occur even when the people are real and their census-day 
address is correctly reported. In other words, these are errors in matching due to processing 
mistakes. 

In our new PES design, matching takes two forms: automated batch matching and computer- 
assisted clerical matching. The status of ‘‘not enumerated’’ is assigned to a P-sample person 
when sufficient information for matching has been gathered and no matching case can be found 
in the census. Errors occur when there actually was insufficient information for matching but 
matching was attempted nonetheless, and also when the correct census questionnaires were 
searched but the match was not established, even though the person was in fact counted in 
the original enumeration. 

A P-sample person occasionally may be declared to match the wrong census person. This 
happens most often within families, where children’s names and ages may be similar, and in 
“‘ethnic’’ neighborhoods where certain names are unusually common. Normally, false matches 
are less common than false nonmatches because the matches can be reviewed easily by a clerical 
matching staff. False matches create a bias in the dual system estimator only when the P-sample 
person was actually not enumerated. 

A principal change in our PES design since 1980 that allows better control of matching error 
is the use of acommon sample of blocks for both P and E samples. The block sample design 
permits a classification of all enumerated people (both P-and E-sample) into three categories: 


— counted in P sample, counted in E sample 
— counted in P sample, missing from E sample 
— missing from P sample, counted in E sample. 


This kind of organization or accounting, which was not possible with the 1980 design, imparts 
to the matching process a quality that resists matching error. For example, people with similar 
names in ethnic neighborhoods can be sorted out using all the information provided by a block 
sample. Address mix-ups in the census process are easier to handle with a block sample. The 
choice of census block as a sampling unit also reduces geographic coding error as compared 
to the 1980 PES, where the P sample was based on CPS clusters of four housing units and 1970 
Census geography. 

Matching is especially difficult for P-sample people who lived elsewhere on census day, i.e., 
movers. For movers, the census-day address reported in the P-sample interview must be assigned 
to the proper geographical area prior to matching. This assignment was problematic in the 1980 
PES and the new design does not necessarily solve the problem. The Census Bureau will, how- 
ever, be using a new, automated geographical system for the 1990 Census (see Marx and Saalfeld 
1988), and we are hopeful that this innovation will permit rapid and accurate geographic assign- 
ment for mover addresses. 

In the 1986 TARO, about 74 percent of the P-sample people were matched by the computer. 
Another 12 percent were declared ‘‘possible match’”’ by the computer. A specially trained clerical 
staff reviewed all cases not designated as ‘‘match’’ by the computer, including all of the 
computer-designated ‘‘possible matches.’’ 
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Table 2 
Results of Rematch Study: Sample (Weighted)? 


Results of Rematching 


Results of 

Original Enu- Not Enu- Un- 

Matching merated merated resolved Total 
Enumerated 16,623 18 55 16,696 
Not Enumerated 88 2,164 56 2,308 
Unresolved Vy 0 132 149 


Total 16,728 2,182 243 LO 53 


4 Weighting is to P-sample totals. 


The results of the 1986 PES in Mississippi show that the success of the computer matching 
system is not limited to urban areas with house numbers, street names and well-defined 
geography. In the Mississippi test, addresses commonly consisted of a rural route and box 
number. Blocks were irregularly shaped with invisible boundaries such as an intermittent stream 
or county line. Still, the computer was able to match 68 percent of the cases. 

We have conducted two studies to evaluate the extent of matching error in TARO. In the 
first study, a subsample of 35 blocks was selected and rematched by professionals from head- 
quarters. The rematch was done independently of the original match, and then discrepancies 
between the match and rematch results were adjudicated. Because of this intensive approach 
to the rematch, we believe the rematch results represent true match status, while differences 
between the match and rematch results represent the bias in the original match results. Only 
nonmovers were considered in this study. Also, the study was confined only to within-block 
rematching, and thus did not formally measure any false nonmatches that may have occurred 
because the census enumeration was located outside the PES block. 

The results for the P sample are given in Table 2 in the form of a cross-tabulation of match 
statuses as assigned from the original TARO match and the rematch. 

We estimate there are about 88 false nonmatches and 18 false matches in the original TARO 
results, and that 111 = 55 + 56 cases originally matched or not matched should have been 
declared to have an indeterminate or unresolved match status. In the normal course of estima- 
tion, the unresolved would be treated by missing data procedures (Schenker 1988). The net 
result is that the observed match rate, i.e., the number matched divided by the number mat- 
ched plus not matched, is .879 in the original match and .885 in the rematch, and thus that 
the original match rate is biased downward by about 0.6 percent. 

The second evaluation study looked at the extent of matching error for movers. Among 
the original ‘‘not matched,’’ there were 90 persons who reported moving between census-day 
and the time of the PES. For movers, searching is done at the reported census-day address. 
As an evaluation of the accuracy of the matching process, we reworked all 90 nonmatched 
mover cases using more intensive procedures. Eleven new matches were discovered, and as a 
result, the observed match rate for in-scope movers increased by .058, from .661 to .719. 
Although, the false nonmatch rate, 11/90 = .122, for movers is larger than we observed for 
nonmovers, the movers comprise a relatively small portion of the overall P sample. Correct- 
ing the 0.6 percent and 5.8 percent downward bias in match rate for nonmovers and movers 
has the overall effect of reducing the TARO undercount rate by 0.7 percent. 

These calculations ignore the possibility of further new matches that might have been 
observed had the rematch study extended beyond the bounds of the PES blocks (Thompson, 
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Whitford and Stoudt 1987). Based on evidence from computer matching across the Los Angeles 
test site, however, we conclude that geographical assignment was accurate, and that the 
incremental effect of such additional matches could do no more than to reduce the estimated 
TARO undercount by a further 0.3 percent. 


3. REPORTING CENSUS-DAY ADDRESS 


In our new PES design, as in the 1980 design, we attempt to match the P-sample people 
to the census enumeration at the census-day address. To facilitate the matching, the P-sample 
interviewer must ask where each household member lived on census day. The interviewer then 
probes for other addresses where the persons may have lived, including such places as at college 
or university, on a military base or ship, or at a second home. If the census-day address is 
reported incorrectly in the P-sample interview, then we may falsely designate the household 
members as not enumerated in the census, thus biasing upwards the estimated undercount 
rate. 

To study address misreporting, we reinterviewed a subsample of the matched and unmatched 
cases after the original TARO estimates of undercount had been produced. This followup was 
six months after the initial PES interview and ten months after census day. Before presenting 
the results, we mention two limitations on this study. The first is the potential of greater recall 
error than in the original P-sample interview. Second, any trust created by the census adver- 
tising program may have faded, a potentially serious problem in an area with a large number 
of undocumented immigrants who fear all contacts with the government. 

Table 3 describes the composition of the subsample. In most cases, the PES household 
matches the census household completely (‘‘whole-household matches’’). In the category 
‘“nartial-household matches,’’ some of the PES persons match the census, but others do not. 
The ‘‘whole-household nonmatch with conflicts’? category constitutes what we call the 
‘‘Emerson-Peterson’’ problem. The census enumerated the ‘‘Emersons”’ at a particular address 
and the E-sample followup confirmed the census enumeration as correct. However, the P- 
sample interview showed the ‘‘Petersons”’ as living at the address on census day. These facts 
are in conflict, and one possible explanation is that the Petersons misreported their census- 
day address. The ‘‘whole household nonmatches without conflicts’? category has no apparent 
contradictions; for example, the census missed the housing unit or listed it as vacant. 


Table 3 
Post-Production Followup Sample Sizes 


Number of Households 
Status of Original Match 


Total in Rein- 
P sample terviewed 
Whole-Household Match 4,662 50 
Partial-Household Match 609 50 
Whole-Household Nonmatch 
with Conflicts 160 64 


Whole-Household Nonmatch 
without Conflicts 357 109 
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Table 4 


Outcome of 
Post Production Followup, (Persons) Unweighted 


Whole-Household Partial-Household Whole- 
NonMatch Match Household 
Match 
Outcome we shoul Non- ounce 
Conflict Conflict matched sind 
# % # 0% # % # % # 07% 
Address Confirmed 64 33 252 W738 61 75 138 90 164 99 
New Address Given 32 17 46 13 13 16 15 10 1 1 
Possible Fabrication 70 36 25 7 2: 2 0 0 
Noninterview 2 14 24 7 5 6 0 0 
Total 193 100 345 100 81 100 153 100 165 100 


Note: # signifies number of people in the followup subsample. 
% signifies percent of column category. 


Table 4 gives the results for persons in the sample, with a separate breakout of initially 
matched v. nonmatched persons in partially matched households. As expected, the rates at 
which the address was confirmed vary greatly across strata. Virtually all addresses were con- 
firmed for the persons in the whole-household match category, while the lowest rate of con- 
firmation was for the whole-household nonmatch with conflicts category. New addresses were 
given by 13 to 17 percent of the nonmatched people across each of the three categories. 
Interestingly, new addresses were reported for ten percent of the matched people within par- 
tially matched households, not much less than for the nonmatched people within these 
households. The newly reported address is unlikely to be correct, unless identical errors were 
made in the original P sample and census interviews. This variable reporting reinforces our 
view that followup interviewing months after the original P-sample interview sometimes gives 
a different response (because of recall error and fear), but not necessarily a more accurate 
address. 

Evidence was gathered on 95 cases that suggest they were possibly fabricated in the original 
P-sample interview. Most of these cases (70) came from the category of whole household non- 
matches with conflicts. This problem is discussed further in Section 4. In addition, there were 
cases where the reinterview was not complete or yielded insufficient information to classify 
individuals into one of the categories. Some of these, had they been correctly interviewed, may 
also have reported a new address. 

Weighting Table 4 to P-sample totals, we estimate that 3.1 percent of P-sample persons were 
erroneously reported as nonmovers in the original P-sample interview. For those who moved 
within the test site, we were able to search for a match at the new address, and we found that 
one third of those cases were enumerated in the Los Angeles test census. To assess the pro- 
bable effect of reporting errors, particularly as we view TARO as a test of a national PES, 
we assume that those people who reported addresses outside the site would have been 
enumerated at the same rate as those who reported addresses within. Thus, one-third of the 
3.1 percent would have been matched and classified as enumerated. Correcting for the reporting 
error results in a one percent reduction in the estimated undercount. 
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4. FABRICATION IN THE PES INTERVIEW 


In spite of all good efforts to train and control interviewers, a PES interviewer may occa- 
sionally fabricate a household in lieu of conducting a proper interview. Fabricated cases will 
not match to the census. The estimated undercount rate will be inflated to the extent that 
fabricated cases substitute for people at the address who were actually enumerated. 

Our new PES design seeks to control the fabrication rate to low levels. The sample design 
allows for frequent quality control checks using re-interviews of the interviewers’ work. Samples 
are checked for each interviewer’s work from each block several times per week. This close review 
was not possible in the 1980 PES, where interview assignments were not as highly clustered. 
We have also improved the training and supervision of the interviewing since 1980. Feedback 
on performance and retraining is now available to interviewers so that errors will not be repeated. 

Two studies shed light on the extent of fabrication in the 1986 TARO. First, extensive quality 
control checks were performed during data collection for the P sample, both for address listing 
and for interviewing. The main conclusion from the quality control results is that there was 
evidence of only a small amount of fabrication. A total of 2070 P-sample interviews were 
checked by quality control clerks a few days after the original interview to verify the household 
composition (roster check). Of these, 59 interviews failed the roster check. These cases were 
examined in detail to determine how many of them were examples of fabrication. This was 
determined by whether each person in the household, as reported by the original interviewer 
(not the quality control clerk), matched to the census, which implies that the original inter- 
viewer collected valid data for that person. A clone fabrication in the census would be needed 
to invalidate this assumption. Only 13 of the 59 cases were identified as possible fabrications 
in that they had, for example, no persons from the original PES roster matching the census. 
Hence, the estimated fabrication rate for the quality control check is 0.6 percent. 

The second source of data on the extent of fabrication is the post-production followup 
described in Section 3. From the data in Table 4, we estimate that about 1.2 percent of the 
P-sample people may have been obtained in fabricated interviews. This fabrication rate is about 
twice as large as provided by the quality control roster check. We believe much of the difference 
is attributable to one bad interviewer whose work was discovered in the followup interview, 
but evidently escaped detection by the quality control system. Another part of the difference 
may be that the followup exaggerates the level of fabrication; that is, landlords and other 
respondents deny the existence of people who occupy illegally converted housing units or who 
are present in the country without documentation. 

To calculate an upper bound on the effect of fabrication in TARO, we assume the higher 
fabrication rate, .012, and we assume that if proper interviews had been conducted, the resulting 
P-sample people would match to the census at the same rate as achieved for the nonfabricated 
cases, or about .88. This leads to a corrected undercount of about 7.9 percent, about 1.1 per- 
cent lower than the original undercount of 9 percent. If we assume the lower fabrication rate, 
.006, then by similar calculations, the corrected undercount is 8.4 percent, or about .6 percent 
lower than the original TARO figure. In the summary of TARO errors presented in Table 1, 
we specified a value of 1 percent, which is about equal to the effect implied by the upper bound. 


5. MISSING DATA 


In order to measure small coverage errors accurately, the PES data set should be as com- 
plete as possible, without a large percentage of missing data. Unfortunately, there was a very 
large amount of missing data in the 1980 study (Fay et a/. 1988). A number of changes in the 
PES design should now lead to lower levels of missingness. 
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Table 5 
PES Missing-Data Rates (%) 


4 1980 PEP 1986 
ource pee TARO 
April August 

P Sample 

Noninterview (Household) 4.4 5.3 0.5 
Unresolved enumeration status (Person) 4.0 4.4 0.8 
Total 8.4 9.7 iis) 
Proxy interview (Household) a a Shy 
E Sample 

Noninterview (Household) joa ed NA 
Geocoding indeterminate (Household) 1.6 1.6 NA 
Unresolved enumeration status (Person) 2.0 2.0 4.7 
Total 4.7 4.7 4.7 


4 Percent unknown. 
NOTE: NA signifies ‘‘not applicable.’’ 


First, because of the tight time schedule for CPS interviewing, the initial P-sample inter- 
views in 1980 were conducted during a one-week period. For the new PES, a three-week inter- 
viewing period is used, with yet another week if special problems arise. The longer interviewing 
period decreases the household noninterview rate. Another change that reduces the household 
noninterview rate is the sample of blocks (rather than list-sample clusters of four housing units 
as in the CPS). This sample allows the interviewer to visit a housing unit several times (per- 
haps between visits to the other housing units in the block) without extreme travel costs. 

Incomplete followup interviews caused a large portion of the missing P-sample enumera- 
tion statuses in the 1980 PES (2.6 percent for April and 2.8 percent for August). We are attempt- 
ing to diminish this problem by collecting the information needed to declare cases as either 
enumerated or missed during the initial interview, thereby eliminating the need for followup 
in most cases. Additionally, improvements in the timing and quality of matching, because of 
the new automated matcher, will reduce the number of cases requiring followup. 

In the new PES design, the P and E samples overlap, and thus most of the information needed 
to determine E-sample enumeration statuses is gathered early, during initial P-sample inter- 
viewing. The use of a block sample, along with improved census geography, also helps reduce 
the proportion of E-sample cases for which correctness of census geocoding cannot be deter- 
mined. Finally, improvements have been made in the treatment of missing data (Schenker 1988). 

As can be seen in Table 5, the missing-data rates for the P sample in TARO are much lower 
than those for the 1980 PES. The E-sample total missing-data rate for TARO is equal to that 
for the 1980 PES, but this was due to an operational error in TARO, and we expect reduc- 
tions in missing data similar to those for the P sample in the future. 

Even though TARO achieved low levels of missing data, it is important to examine what 
effect the missing data has on the estimated undercount rates. To answer this question, we pro- 
duced several sets of undercount estimates for TARO derived using alternative treatments of 
missing data, P-sample proxy interviews, P-sample movers, and certain E-sample unresolved 
cases. See Schenker (1988) for a detailed description of the alternative estimates, which ranged 
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from a low of 7.8 percent to a high of 9.4 percent. Two of the alternative treatments consid- 
ered in Schenker (1988) deal with problems discussed elsewhere in our paper; they are the treat- 
ment of movers within the test site (Sections 2 and 3) and E-sample resolved cases that may 
have been fictitious enumerations (Section 6). The effects of these treatments are attributed 
in Table 1 to sources of error other than missing data, and are the main reason for the dif- 
ference between the TARO undercount estimate of 9 percent and the lowest alternative estimate 
of 7.8 percent. When the other treatments discussed in Schenker (1988) are considered, the 
change in the estimated undercount ranges from — 0.3 percent to 0.3 percent. These changes 
are quite small and it is uncertain in which direction the true effect lies. Hence, we have listed 
a mean effect of 0.0 percent in Table 1. 


6. ERROR IN MEASURING THE ERRONEOUS ENUMERATIONS 


To estimate net coverage error, it is necessary to estimate the number of erroneous enumera- 
tions (EE) contained within the original census enumeration. EE includes the following distinct 
categories: 

(i) fabrication in the census, where the census enumerator or respondent creates fictitious 
people in lieu of conducting a proper interview; 

(ii) census duplicates; 

(iii) persons born after census-day and persons who died before census-day; and 

(iv) persons enumerated in the census with such sparse or incomplete information as to 

render them unmatchable to the PES. 

All of these categories are estimated by way of the E sample. In addition, certain census 
geographic coding errors are treated as erroneous enumerations; this problem is part of the 
balancing issue discussed in Section 7. 

In the 1980 PES, the E sample was a separate and independent sample of 110,000 census 
household enumerations. Interviewers revisited the housing units 8 months after census day 
to verify that the census enumerations were either correct or erroneous. Also, the housing unit 
was located on a map to see if it was assigned to the correct census geography, and clerks 
searched the census records to identify duplicates. 

We have instituted two important changes in the new E-sample design. First, as already 
discussed, both the E and P samples will now be based on the same sample of blocks. We have 
found that overlapping P and E samples reduces geographic assignment errors. Second, most 
E-sample data will be collected in July, just three months after census-day. The procedures are 
such that most E-sample people are automatically designated correctly enumerated if they are 
counted in the P sample in July and are subsequently matched correctly to the person’s E-sample 
enumeration. Unmatched E-sample cases are tagged for a followup interview, occurring only 
6 months after census day. The earlier reporting in this new design lowers the missing data rates, 
reduces reliance upon proxy respondents, and improves the quality of the collected data. 

There are four main components of error in the measurement of EE: 

(i) response errors in the E-sample interview (this is the P-sample interview for most cases 
and the followup interview for all other cases), or mis-coding of responses by the pro- 
cessing staff; 

(ii) error committed by an interviewer or by staff in assigning the correct geographic code 
to an E-sample person; 

(iii) error in conducting the search for duplicates; and 

(iv) mistakes made in classifying an E-sample case as having insufficient information for 

matching. 
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In addition, there are errors due to non-response in the E-sample interview, as discussed in 
Section 5, and sampling error, as discussed in Section 9. 

Response errors often relate to the assignment of the status of ‘‘fictitious’’ to an E-sample 
person. The E-sample interviewer sometimes finds that the current resident of a unit (or another 
eligible respondent) does not know the people listed in the census. Usually, this is because the 
current resident moved in after the census and simply does not know who was living there at 
the time of the census. These E-sample cases should be designated as nonresponse. However, 
if the census enumerations were fabricated, no respondent will know the ‘‘people’’ reported 
in the census. 

In experimenting with the new design in the TARO, the E-sample interviewers were instructed 
to determine whether the E-sample enumerations were fictitious and to record the basis for 
their decisions. Initially, the clerks required very strong evidence before designating an E-sample 
person as fictitious. It was this data that was used in preparing the first TARO estimates of 
total population and percent undercount. We realized that the rules for coding were being inter- 
preted too strictly, and later, we had professionals review all E-sample cases coded as ‘‘noninter- 
view, respondent does not know’’ to determine if any should have been coded as ‘‘fictitious’’. 
Out of 257 such E-sample cases, 118 were coded by the professionals as ‘‘fictitious.’’ The cor- 
rected information was used to create some alternative TARO estimates (Schenker 1988). 

Geographic assignment of census returns was generally thought to be very good in the Los 
Angeles test site, which was a long-established neighborhood with large well-defined blocks. 
We have not produced formal measures of the effects of geographic misassignment on the 
estimated EE, but we believe such error is negligible. In other areas of the U.S., however, the 
errors could be nonnegligible either because of poor maps, poor or incomplete addresses, or 
confusion about geographic locations created by new construction. 

For example, in contrast with Los Angeles, geographic assignment was a problem in the 
1986 Mississippi census returns. There we discovered 2.22 percent of the E sample was 
duplicated. Of the duplicate cases, 35 percent were located outside the sample block. Although 
we were able to find many duplicates outside the sample block, we are not convinced we found 
all of them. This is because searching for duplicates was not designed as a separate activity. 
We only identified duplicates in the course of other PES operations, and thus probably missed 
many of them. In the next PES, we will implement a separate activity to search for duplicates. 

The census sometimes enumerates people with such sparse information that even if they were 
correctly interviewed in the P sample, a match to the E sample would not be possible. To com- 
pensate for this problem, such E-sample cases should be included in EE so as to estimate the 
total population properly. This problem is similar to that of geographic balancing discussed 
in Section 7. The separate E and P samples in the 1980 PES made it very difficult to do this 
consistently; similar cases were classified as ‘‘unmatchable”’ in the E sample and ‘‘matchable’’ 
in the P sample, thus creating a bias in the dual system estimator. Because the new PES design 
uses Overlapping P and E samples, we ensure that identical rules are applied, thus eliminating 
the bias. 

In another evaluation of the TARO, and as part of the rematch study discussed earlier (see 
Section 2), the E-sample cases in a subsample of 35 blocks were reprocessed by professionals 
from headquarters. As in Section 2, the rematch was independent of the original work, with 
subsequent adjudication of any discrepancies. Thus, we believe the rematch represents the best 
possible determination of the true enumeration statuses of the E-sample people, while dif- 
ferences between the original work and the rematch may be regarded as a measure of bias due 
to error in the original work. 
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Table 6 
Results of Rematch Study: E Sample (Weighted)? 


Results of Rematching 


Original Results Correct Erroneous 


: : Unresolved Total 
Enumeration Enumeration 


Correct Enumeration 19,153 28 88 19,269 
Erroneous Enumeration 41 283 1 325 
Unresolved 140 100 223 463 


Total 19,334 411 SZ 20,057 


4 Weighting is to E-sample totals. 


Results are presented in Table 6. Notice that most of the changes involve cases originally 
classified as ‘‘unresolved.’’ Many of these cases were those discussed earlier, requiring a sub- 
jective decision between ‘‘fictitious’’ and ‘‘nonresponse.’’ Based on these data, we believe that 
better clerical procedures are needed for coding E-sample cases as fictitious. We are presently 
working to implement improved procedures in the Census Bureau’s next PES, to be done in 
conjunction with a 1988 dress rehearsal of the 1990 Census. 

From the rematch study, we believe the original rate of EE, 


325 


————— = .016 
325 + 19,269 
should be increased to about 
411 
=o Q21% 
411 + 19,334 


This implies the original TARO undercount should be reduced by about 0.5 percent. The cor- 
rected undercount is thus about 8.5 percent. 


7. BALANCING GROSS OVERCOUNTS AND UNDERCOUNTS 


In order to estimate net undercoverage, the methods and concepts used to measure gross 
overcount must be consistent with those used to measure gross undercount. We refer to this 
requirement as ‘‘balancing.’’ We proceed to give an elementary description of how the PES 
achieves balance. 

One way to view this issue is to consider the dual system estimator in the form 


where 


Ni; = M, 


the weighted number of matched P-sample people, and 
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the weighted number of people in the P sample. All notation is defined in Diffendal (1988). 


Since we cannot search all census questionnaires, the observed number matched, M, will 
be lower than the true number in both systems. To make costs manageable, matching for a 
given case is restricted to a ‘‘search area’’, typically the sample block and one or two rings of 
surrounding blocks. 

As a consequence, the term N,, estimates KN*,,, where 0 < k < 1 is the conditional 
probability that a census enumerated individual is counted in the correct search area and N*,, 
is the PES estimator of N,; that would obtain if it were feasible to conduct the search for 
matches over the entire population. 

To construct a consistent estimator of population size, we must reduce the number counted 
in the census by the factor k. Because the E-sample search for erroneous enumerations, é.g., 
duplicates, extends over the search area and we treat as erroneous all enumerations that should 
not be included in the search area, the term N,, estimates k N*,,, where N*,, is the 
estimator of N,, that would obtain if it were feasible to conduct the search for erroneous 
enumerations over the entire population. 

Assuming consistent search areas, the DSE becomes a consistent estimator of N, , . Note 
that in this model of the balancing process, we do not estimate the probability k, but instead 
rely on consistent search areas to eliminate it from the DSE. 

Balancing the P sample and the E sample in the 1980 PES was impossible because the samples 
did not overlap. The CPS (or P-sample) addresses were coded to census geography. The search 
area was to have been limited to a close neighborhood of the CPS address, but because the 
CPS addresses were based on 1970 Census geography, they could not be easily assigned 1980 
Census geographic codes, and searching extended over a wide area. As the search area expanded 
for the P sample, the E-sample search area should also have expanded. We believe incon- 
sistencies arose between E-and P-sample search areas, thus creating a bias in the DSE. 

In TARO, we performed the two-way match between the P-and E-sample persons within 
the selected blocks. The geography and search areas were consistent, well-defined, and well- 
controlled during computer and clerical matching. As a consequence, the problem of balancing 
did not introduce any important bias into the Los Angeles results. 


8. CORRELATION BIAS 


For the dual system estimator to be a consistent estimator of the true population size N, , , 
two independence assumptions are needed: 


(i) causal independence, 
(11) heterogeneous independence. 


In addition, autonomous independence is often assumed, but failure of this assumption is 
known to impart little or no bias to the estimate of total population. (Wolter 1986b and Cowan 
and Malec 1986). 

Causal independence fails when an individual’s capture history in the census alters the pro- 
babilities of capture in the PES. The estimator NV, , is downward biased when the odds of 
capture in the PES are increased as a result of capture in the census, and is upward-biased when 
the odds of capture in the PES are reduced as a result of capture in the census. 

An important bias may exist in the April 1980 PES data because of a failure of causal inde- 
pendence. The failure occurred because respondents may have mistaken the April or March 
CPS enumerations for the census enumeration. 
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Table 7 


Undercounts (%) for Black and Total Population the 1980, 1960 and 1950 
U.S. Censuses, and Differential Undercount Rates 


Source Black Total Difference 
1950 
PES Sie) 1.4 1.8 
DA 9.6 4.4 52) 
1960 
PES 3.8 1.9 1.9 
DA 8.3 323 5.0 
1980 
PES? 
Low itl —1.0 Dal 
Middle 6.9 1.4 55 
High Sai Dy 3.6 
DA 5.9 1.4 4.5 


4 The 1980 PES produced 12 sets of estimates. The three presented here are selected from 
the highest, middle and lowest set as measured by estimated total undercount. 


Heterogeneous independence fails when census capture probabilities are different from one 
individual to another. The resulting bias (called heterogeneity bias or correlation bias) is gen- 
erally thought to be a downward bias because individuals with a high probability of capture 
in the census also tend to have a high probability of capture in the PES and, conversely, 
individuals with a low probability of capture in the census also tend to have a low probability 
of capture in the PES. 

Sekar and Deming (1949) suggested post-stratification to control heterogeneity bias. In prac- 
tical applications, it is unlikely that this technique is fully effective; there is inevitably some 
residual heterogeneity of capture probabilities within post-strata. 

In the dual-system model, the number of people missed by both systems, Np, is estimated by 


Ny = Ny2No1/Ni1, 


as in Diffendal (1988), equation(2). Because the dual system estimator may be expressed in 
the form 


Nig = Ny + Mi + Ny + Ny, 


and because N,,, Nj>, and N>, are direct design-based estimators, any bias due to failure of 
the independence assumptions arises solely in N>, as an estimator of Np. 

We can study the correlation bias in 1980 and previous censuses by comparing N, , to 
independent demographic analysis (DA) estimates of total population. Table 7 presents relevant 
data from recent censuses. If one treats demographic analysis estimates as a standard, these 
comparisons display total bias in the dual system estimator, including both correlation bias 
and other sources of error. We believe that the downward bias shown in these estimates is largely 
attributable to correlation bias. The 1950 PES gave severe underestimates of the population 
size, of the percent undercount, and of the differential undercount, presumably because of 
both causal and heterogeneity bias. Note, however, that if 1950 PES data had been used to 
correct the 1950 census, the differential undercount would have been reduced from 5.2 per- 
centage points to approximately 3.4 percentage points. 
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The 1960 PES gave similar underestimates of population size, of the percent undercount 
and of the differential undercount, again presumably because of correlation bias. If the 1960 
PES data had been used to correct the 1960 census, the differential undercount would have 
been reduced from 5.0 to approximately 3.1 percent. 

No PES was conducted in 1970. The 1980 PES produced 12 sets of estimated undercounts 
based on the April and August results and on different sets of assumptions. The DA under- 
count rates are approximately in the middle of the 12 PES undercount rates. Correlation bias 
is not as evident here as in 1950 or 1960, largely because of improvements that were made in 
1980 to reduce positive causal dependence. We believe the heterogeneity bias is still present 
but is obscured by other PES errors and by bias due to negative causal dependence. 

In the new PES design, we attempt to control the bias due to causal effects by scheduling 
the PES enumeration after most major census field activities. This approach, contrary to that 
of the April 1980 PES, will promote causal independence between the census and PES enumera- 
tions as much as possible. Further, we are now using field office procedures that will promote 
causal independence, such as assigning PES interviewers to different areas than they worked 
(if they worked) in the original census enumeration. 

It will be difficult to eliminate the correlation bias due to heterogeneity in future PES’s. 
The only possible avenues include more effective post-stratification and combining the PES 
and DA data in some way, possibly by controlling for DA sex ratios. See Wolter (1986c) and 
Choi, Steel and Skinner (1988). We have done some experimentation this decade with alter- 
native post-stratification schemes including using variables such as owner/renter status, census 
mail-back rate, and marital status. These approaches show some promise. See Diffendal (1988). 

TARO yielded observed differential undercounts consistent with expected differentials. In 
the U.S., census coverage is normally lower for males than females. This result has been con- 
sistently observed from the results of demographic analysis. The TARO sex ratios (males per 
100 females) are higher than the census ratios for Hispanics and for people who were neither 
Hispanic nor Asian. The TARO sex-ratios are much higher than census sex-ratios (1.1 to 3.4 
more males per females) for the 30-44 year age group. This outcome is consistent with the 1980 
national results from demographic analysis. Thus, we believe that the TARO sex ratios are 
closer to the true sex-ratios, and although correlation bias limits the gain, the PES is still able 
to measure the differential undercount. 

Table 8 presents the two-way table of data for the 1986 TARO, with no post-stratification. 
The estimate of the number missed by both systems, 


N55 = "55870, 


is approximately the same order of magnitude as census substitutions 5,259 and erroneous 
enumerations 6,426. Approximately one-eighth of the estimated census misses, Njy + No = 
44,373, are attributable to the (2,2) cell. Thus, most of the measured undercount arises from 
direct survey estimation, not from the dual-system model. 

To illustrate the effect of correlation bias, consider doubling the size of the (2,2) cell. This 
increases the estimated undercount rate by about 1.4 percent. Based upon analysis of the 1980 
PES, Ericksen and Kadane (1985) suggest multiplying the (2,2) cell by 2.7, thus increasing the 
estimated undercount by 2.3 percent. 

We have other information that sheds light upon the problem of correlation bias. Three 
anthropologists worked for the Census Bureau as participant or systematic observers in the 
Los Angeles test. Their observations do not provide direct measurements of correlation bias, 
but rather they provide insights into the degree to which the census and PES are missing the 
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Table 8 
Dual-System Estimates for 1986 Los Angeles Test Census 


PES 
Counted Missed Total 
Counted 298,204 45,463 343 667 
SOLOS Missed 38,503 | 5,870 44,373 
Enumerations 
Total 336,707 51,333 388,040 


4 Correct Census Enumerations = Total Census Enumerations — Substitutions — Erro- 
neous Enumerations. 


same kinds of people. The reports suggest that there are people with very low capture pro- 
babilities who tend to be missed by both the census and the PES, and thus that an important 
downward bias may be present in TARO data. See Hainer et a/. (1988) and Hines (1988). 

Given the data available, we have no exact means of assessing the level of correlation bias 
in the TARO data. Nevertheless, based upon the work just cited, we speculate that the TARO 
undercount rate may be too small by 2.3 percent or more. 


9. RANDOM ERROR 


Sampling error affects the estimates of the number of matches, the number of erroneous 
enumerations, and the P-sample totals. The census count and the number of substituted census 
people are based upon the 100 percent census enumeration, and as such are not contaminated 
by sampling error. The estimated standard deviation for the undercount rate is 0.007. Soa 
95 percent normal-theory confidence interval for the undercount rate is .09 + 2 (.007) = 
(.076, .104). 

Diffendal (1988) presents estimated standard errors for the TARO adjustment factors 
defined by Y = N, ,/CENand used a components-of-variance model to smooth the Y, thus 
reducing the effects of sampling error. In most cases, the smoothing substantially reduced the 
estimated standard errors, particularly for domains. We believe such smoothing can be used 
profitably in future PES’s. 


10. CONCLUSION 


After the 1980 Census, the Census Bureau reviewed its coverage measurement program and 
identified the program’s weaknesses. We instituted a research program and a new coverage 
measurement design aimed at reducing the weaknesses. We have completed major tests of the 
new PES design this decade and have demonstrated substantial improvements over the 1980 
PES: 

In this article, we reviewed the results of our research program as reflected in the 1986 TARO. 
There may never be a perfect PES. However, none of the weaknesses or errors in the new design 
are so large as to invalidate the PES results. For reasons stated in Section 1, we believe the 
joint effect of the errors in the coverage measurement in TARO is smaller than the error in 
the original enumeration in Los Angeles. 
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One of the main benefits of the TARO is that it enables us to identify new questions and 
minor unresolved problems that warrant further research. For example, the initial PES inter- 
view attempted to gather the information needed to declare a P-sample person as missed in 
the census. We are now refining the questionnaire design, including additional screening 
questions to identify movers more accurately. In future PES’s, we will also conduct followup 
interviews for most movers and for nonmover households in the P sample suspected of having 
misreported mover status. In this way, we believe mover misreporting can be kept to a 
minimum. 

The quality control procedures that are intended to detect and correct fabrication in the 
PES must continue to be improved and tested. In addition to verifying names on the PES roster, 
other items shall be verified as part of the quality control check. This should detect any partial 
fabrication that occurs by obtaining names from mailboxes or landlords, and fabricating the 
characteristics. We are revising the PES followup forms in order to facilitate the identification 
of fictitious people. 

Our goal for future PES’s is to minimize missing data, especially through minimizing the 
need for followup. However, as more cases are sent to followup, the proportion of failed 
followup cases will increase. Research is needed on the proper treatment of these cases. 

Not withstanding the good results from TARO, one should exercise appropriate caution 
before drawing the conclusion that the 1990 PES results will be closer to the truth than will 
the 1990 original enumeration. The actual level of net undercount in the Los Angeles test was 
high compared to what would be expected in a national census. Will the size of the errors in 
a national PES be small enough to produce more accurate population estimates? 

We believe that the 1990 Census will contain areas with large undercounts and perhaps large 
overcounts, even if there is a small net national undercount. Thus, the PES should produce 
the more accurate population estimates for the areas most difficult to count. Through further 
polishing of the new PES during the last two years of this decade, it may be possible to produce 
more accurate population estimates for other, less-difficult-to-count areas too. 

We also believe that the errors in the PES will decrease as the undercount decreases. Stable 
areas with good maps, well-defined addresses, few movers and cooperative respondents will 
be relatively easy for both the census and the PES. Residual processing errors may produce 
a threshold of accuracy beyond which the PES may not go, regardless of the true net under- 
count. We will not know for sure until the 1990 PES is executed. This situation may lead the 
PES estimates to be more accurate than original census estimates for some areas, with equal 
or nearly equal accuracy for most other areas. Statistical theory should provide a means to 
produce a best estimate by combining the results of the original enumeration and the PES. 
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Modeling Matching Error and its Effect on Estimates 
of Census Coverage Error 


PAUL P. BIEMER! 


ABSTRACT 


Dual system estimators of census undercount rely heavily on the assumption that persons in the evalua- 
tion survey can be accurately linked to the same persons in the census. Mismatches and erroneous non- 
matches, which are unavoidable, reduce the accuracy of the estimators. Studies have shown that the extent 
of the error can be so large relative to the size of census coverage error as to render the estimate unusable. 
In this paper, we propose a model for investigating the effect of matching error on the estimators of census 
undercount and illustrate its use for the 1990 census undercount evaluation program. The mean square 
error of the dual system estimator is derived under the proposed model and the components of MSE arising 
from matching error are defined and explained. Under the assumed model, the effect of matching error 
on the MSE of the estimator of census undercount is investigated. Finally, a methodology for employing 
the model for the optimal design of matching error evaluation studies will be illustrated and the form 
of the estimators will be given. 


KEY WORDS: Undercount; Dual system estimation; Capture-recapture; Nonsampling error; Processing 
error. 


1. INTRODUCTION 


The use of capture-recapture methods for census evaluation and the evaluation of birth- 
death registration was first suggested by Sekar and Deming (1949). For estimating census cov- 
erage error, the method involves matching persons from a sample survey of the population 
to the census in order to determine the number of individuals which were enumerated in both 
the sample survey and the census. There are a number of difficulties which may occur in the 
capture-recapture method to cause substantial biases in an estimate of the total population 
size, N (see for example Burnham et a/. 1987 and Wolter 1986). A problem which occurs quite 
often in applications of the procedure is the failure to accurately match persons from the sample 
survey to the census. Seltzer and Adlakta (1974) demonstrated that matching error can result 
in relative biases as large as 33% and may be positive or negative depending upon whether false 
nonmatches or false matches predominate (see also Scheuren and Oh 1985). Wolter (1983) notes 
that suspected matching errors in the 1980 Post Enumeration Program were a part of the reason 
not to adjust the 1980 U.S. Census. 

This paper provides a basic framework for evaluating the matching error in capture-recapture 
studies (particularly for applications to human populations) and for assessing the impact of 
the errors on the accuracy of the estimate of N. To provide a simple and familiar basis for the 
discussion of matching error, we shall adopt the original Sekar-Deming capture-recapture 
model. Extensions of the Sekar-Deming technique are given in Marks, Seltzer and Krotki (1974), 
and Wolter (1986). 
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Consider a population U and let N denote the size of U. A census is conducted and WN, 
persons are counted. We wish to estimate N-N, (referred to as the coverage error of the census) 
which is equivalent to estimating N. A post enumeration survey (PES) is conducted which 
employs the same reference period as the census. We assume that: (a) both the census and the 
PES contain no spurious events (i.e., duplications, fabrications, out-of-scope persons or uniden- 
tifiable persons) or that the number of such events can be accurately estimated and subtracted 
from N-; and (b) the event of being counted in the census is independent of the event of being 
counted in the PES. 

The PES persons are matched to the census in order to determine the number of PES persons 
who were also counted in the census. Let x,, denote the design unbiased estimator of the total 
number of persons in both the PES and the census populations and let N, denote the design 
unbiased estimator of the PES population size. The Sekar-Deming estimator (more recently 
referred to as the dual system estimator or DSE) of N is 


N, Ne 


X11 


N= 


(1) 


As we shall see, Nis subject to two sources of error: sampling error and nonsampling error. 
Although there may be several sources of nonsampling error, the source of the error of con- 
cern here is matching error; i.e., the misclassification of PES persons as enumerated in the 
census (false positive errors) or not enumerated in the census (false negative errors). 

Using Taylor series expansions, general forms for the moments of N can be derived. It can 
be shown that, to terms of order 1 /n, where 7 is the PES sample size, 


Bias (N) = — N[Relbias (f,,) — Relvar (fj;)] (2) 
x [1 + Relbias (6,,)] 7! 


and 


Var (N) = N? Relvar (6;,) [ 1 + Relbias (f,,) ] ~? (3) 


where J; = X\,/N, is an estimator of p,, the true proportion of the PES population falling 
in the census population; Relbias (6,,;) = Bias (f,;)/p 1,3; and Relvar (),,;) = Var(f;;) x 
E~? (p,,). Here we have assumed that N,, the census counts, has a variance of zero. This is 
a simplification since, as we mentioned, an estimate of the census spurious events may have 
been subtracted from the census count to obtain N, and this correction may be subject to 
sampling and other errors. Nevertheless, the assumption is consistent with our emphasis in 
this paper on matching error and its effect on N. The last section discusses an extension of 
the methodology which allows error in the estimator N.. 

From (2) and (3) we note that the total mean square error (MSE) of N depends upon the 
total MSE of fj,. In the following section, we consider some models for evaluating the effects 
of matching error on f,;. Letting 7 (j=1,...,”) be the index for the /“ individual in the PES 
sample, we define a; as the probability that individual / is misclassified in the matching process 
and consider alternative assumptions regarding the probabilities a;. 
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2. MATCHING ERROR MODELS 


2.1 Uncorrelated Matching Error 


Assume: 


1. The event {unit / is misclassified } is independent of the event {unit /’ is misclassified } 
forall ys). 

2. a; = if unit / is truly in the census, referred to as the probability of a false negative 
error, and a; = ¢ if unit / is truly not in the census, referred to as the probability of a 
false positive error. 

To fix the ideas, we assume simple random sampling for the PES and that n is small relative 

to N, then 


E(pi;) = py, (1-8) + (1—py) 4, (4) 
Bias(/1;) = —py8 + (1—pi,)¢ (5) 
Var(p,;) = n7' E(py) (1A-E(hy1)) (6) 


n~! (SV +SMV), 


where SV, denoting sampling variance, is given by 
SV = py (1-py) (1-0-4)? (7) 
and where SMV, denoting simple matching variance, is given by 
SMV = p,,6(1—-8) + (1-—pi,) 611-4) (8) 


(proof in the appendix). 

Readers familiar with the Hansen, Hurwitz, and Pritzker (1964) response error model will 
recognize the correspondence of their simple response variance and SMV in this model. Hansen, 
et al. define a measure I, referred to as the ‘‘index of [response] inconsistency,’’ to be the 
ratio of the simple response variance to the total variance of a single response, i.e., the pro- 
portion of variance which is response variance. For survey responses, J is an indicator of the 
response reliability of the survey information. An analogous measure can be obtained for mat- 
ching error to indicate the effect on the variance of f,,; of matching unreliability. This 
measure, denoted by Jy, is given by 


SMV 


mai ES Bird AS (9) 
SV + SMV 


Im 


For some applications, assumptions (1) and (2) may be too restrictive. The independence 
assumption (1) is violated, for example, when unit B in the PES is erroneously matched to unit 
A in the census causing the correct match, unit A in the PES, to be erroneously classified as 
a nonmatch. Since this implies that the errors for units A and B are negatively correlated, the 
consequence is that Var(,,) will be smaller than given by (6). However, E(,,) is not 
affected by correlated errors. Another form of correlated matching error arises when matching 
is performed by clerks who may vary in their tendencies to commit false positive and false 
negative errors. The next section provides a model that describes these errors. 
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Assumption 2 specifies that the misclassification probabilities a; are homogeneous across 
the PES population. This too may be a simplification since some individuals, perhaps the 
majority, may be classified with relatively little risk of error while other individuals are more 
difficult to match. Basically, matching problems arise from inaccurate or incomplete infor- 
mation about the characteristics of each individual in either or both systems. Therefore, if the 
PES sample can be post-stratified on the basis of the completeness of the information to be 
used for matching, the assumption may hold (at least approximately) within each stratum. The 
overall matching error rate is thus an aggregation of the individual stratum error rates. The 
last subsection explores this model. 

Finally, the assumption of simple random sampling greatly reduces the complexity of the 
formula for Var (f;,). Since PES samples are complex samples, the assumption is a simplifica- 
tion, yet it still provides useful formulas for: (a) identifying which components of matching 
error are likely to have the greatest impact on the total MSE of N; and (b) allocating resources 
for and designing matching error evaluation studies. In many situations, an adjustment of SV 
by a ‘“‘design effect’? constant will account for most of the effect of complex sampling on 
Var(f,). Further, E(f;;) is essentially unaffected by more complex forms of sampling than 
simple random sampling as long as /,, is appropriately weighted. Thus, the form of B(/;;) 
does not depend upon this assumption. 


2.2 Modeling Clerical Error 


Suppose the PES is matched clerically to the census using k clerks. Let m; denote the 
number of PES individuals classified by clerk i, i=1,.. .,k. Let the double index (i,/) denote 
the j“ individual in the 7” clerk’s assignment. 


Assume: 


1. The event {unit (i,/) is misclassified} and the event {unit (7’,/’) is misclassified} are 
independent when i # i’ and conditionally independent given clerk i for 1=1’; j#J’; 
b= ee J ee Tne 


2. a = 6; if individual (i,/) is truly in the census, and = ¢,if individual (i,/) is truly not 
in the census. 


3. E(0;) = 6; E(;) = ¢; Var(o;) = 03; V(o;) = 03; and Cov(6;, $;) = a0. 


For the subset of individuals in the i” clerk’s assignment, 1 and 2 are analogous to assump- 
tions 1 and 2 for the model of the last section. Assumption 3 specifies that clerk matching error 
probabilities are independent and identically distributed random variables. This assumption 
is analogous to the assumptions made for interviewer errors in interviewer effect models (see 
for example Kish 1962, Hartley and Rao 1978 and Biemer and Stokes 1985). The assumption 
is appropriate if our interest lies in estimating the parameters of a much larger pool of clerks 
of which the k PES clerks are a representative sample. 

It is shown in the appendix that, assuming simple random sampling, E(/,) is still given 
by (4). The general formula for Var (;,) is given by (A.3) in the appendix; however, a useful 
simplification results if we can assume that the assignment sizes m; are approximately equal 
to m, the average size, and that each clerk’s assignment has the same expected number of 
matches (i.e., clerk assignments are interpenetrated). Then 


bs 1 m—-11 
Vat (ab eel eds SVL) eae 2 SS (11) 
n it _k 


where CC, denoting the correlated component of matching variance, is 
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CC = pj o§ + (1—py1) 703 — 2p, (1—py1) O48 (12) 


and SV, SMV are given by (7) and (8), respectively. 

Note that CC is a consequence of the between clerk variability of the misclassification pro- 
babilities 6; and ¢;. Further, by noting that CC is the variance of —p,, 6; + (1—p,;)¢; and 
the similarity of these terms with (5), we see that CC is the variance of the net biases among 
clerks. This latter fact proves that CC must be positive. Therefore, the effect of clerk variance 
is to increase the variance of f,,. 

Borrowing again from the response variance literature, we can define a parameter p,y which 
is analogous to the intra-interviewer correlation coefficient, 0, defined by Kish (1962). We shall 
refer to py, as the intra-clerk correlation since it is the correlation between the match classifica- 
tions of any two units in the same clerk assignment. Under the model, 


CG 


OM TSS SOSA 


is the ratio of the correlated component of variance to the total variance associated with a single 
classification. It may be interpreted as the degree to which clerks ‘‘influence”’ the match rates within 
their assignments. Now, an alternative formula for Var ($,,) which is equivalent to (11) is 


SV + SMV 
Var (py;) = Pat bia aay (13) 


2.3 Post-stratification 


Both the model for uncorrelated error and the model for clerical error assume (essentially) 
that individuals in the PES sample do not differ in the degree of difficulty of determining their 
true match classification (assumption 2 for both models). For example, for the clerical error 
model, the misclassification probability vector (6;, ¢;) is the same for all units in the i” clerk’s 
assignment. In reality, however, some individuals are much more difficult to classify than others 
depending upon such factors as the completeness of the matching information, whether a mover 
or non-mover, whether in single family home or apartment, etc. 

A simple approach for modeling this situation is to stratify PES sample according to some 
variable, say Z, which is correlated with the misclassification probabilities a;. The variable Z 
may be an indicator of the completeness of the information, the type of unit, etc. 

Suppose there are L such strata indexed by h. Let (i,h,/) denote the j“ unit in the h” 
stratum in the i clerks assignment where i=1,...,k; h=1,...,L, j=0,.. .,Mips and Mp, iS 
the number of units in stratum h for the i clerk. We shall again assume (1) as for the clerical 
error model; however, in addition assume: 


2. Ming = jy if individual (i,/,/) is truly in the census. 


= ¢j;, if individual (i,/,/) is truly not in the census. 


3. E(0in) = 9,5 E( din) = op 
Var(Oin) = Gn; Var( din) = o3n3 
Cov(Gin’, in) = Spon if h=h’ 

=O iutpsehy 
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Under these assumptions, we have Bias(,;) = Lm, Bias (f,,;) and Var (f;;) = Doni Var 
(Bin) + Ut, LEC) — E(61) 1? where Bias (f11,), E(Ai1,), and Var (Hin) are given 
by (5), (4), and (6), respectively, indexing the clerk error parameters and p,; by h and where 
a, = E(n;/n), the proportion of the population in the h™ stratum. 


3. DEMONSTRATION OF THE EFFECT ON TOTAL ERROR 


The models of the previous section can be useful for demonstrating the effect.of matching 
error on the total mean square error of N and fj). In the illustrations that follow, we shall 
assume values of the model parameters which are typical given our experience and which are 
consistent with current 1990 PES design parameters. 

In the PES, estimates of N will be made for a number of census strata. We assume that the 
desired coefficient of variation of the estimates is 1%. Matching will be conducted in a number 
of processing sites by teams of clerks. (More details on the matching operation are given in 
the next section). To illustrate the effect of matching error on the DSE, we consider a ‘‘typical”’ 
PES stratum. For this stratum, let p,; = .85 and k, the number of matching clerks in one pro- 
cessing site, be 10. In our analysis, we considered values of @ which varied from 0 to .10 and 
a number of typical values for the ratio y = 6/4, i.e., the ratio of the probability of false 
negatives to the probability of false positives. Little information exists which would indicate 
the typical range of pj, since no study has ever measured p,y for matching error. However, if 
we assume that the clerk error probabilities 6; and ¢; follow a unimodal beta-distribution and 
are uncorrelated, we can obtain a maximum value for p, corresponding to given values of the 
expected error probabilities @ and ¢. Algebraically, the maximum value of py is given by 


pu = CC*/(SMV + SV) (14) 


where CC* = p74, 6? (1—0)/(1+6) + (1—py)? 67(1—¢) /(1+¢) (see Johnson and 
Kotz 1970, for the underlying theory). If 6; and ¢; are positively correlated, then the assump- 
tion of zero correlation further exaggerates the effect of CC. Thus, the illustrations which follow 
indicate the maximum impact of matching variance on the estimates. 

To illustrate the maximum effect of correlated variance on the precision of ,,, the coeffi- 
cient of variation of ,,, denoted by CV(f,,), was graphed as a function of @ for various 
values of y. For these calculations, pj, was substituted for py, in (13). The range of 6 was 
0<6<.10andy was .5<y<5;i.e., 6 = .26 to d = 20. This range of values of y seems 
reasonable since, typically, @ is smaller than 6. Figure 1 shows the function for y = 1. There 
was no discernible difference for other values of y in the range of interest. Thus, it appears 
that the size of ¢ has negligible effect on CV(f,,). In fact, we see from the expression for 
CC that when p;, = .85, no more than 3% of the correlated variance is contributed by the 
variance of ¢; even when ¢ is the same size as 9. Figure 1 also suggest that CV(p,,) may be 
increased two-fold to 2% for values of 6 as small as 5%. 

In Figure 2, the relative bias of f,;, denoted by RB(f,,) is illustrated for the same range of 
6;i.c.,0< 6.1, andy;i.e.,.5 < y < 5. The graph clearly indicates that bias is smaller for 
smaller values of y. In fact, the bias is zero when y = (1 —py;) /p,; or .18 assuming p,; = .85 
as in this example. For 6 as small as 5%, the relative bias is between — 2% and — 4%, depending 
upon the size of y. Comparing this with the maximum increase in CV (p,,) of one percentage 
point, we see that bias has the potential to be much more serious than correlated variance. 
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Figure 1. Coefficient of Variation of #,, as a Function of © for y = 1 
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Figure 2. Relative Bias of 6,, as a Function of © for Selected Values of + 
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To indicate the potential effects of matching error on N, the increase in total error as a func- 
tion of 6 and for selected values of y was computed. Let M(0,;y), V(@;7), and B(6;y) denote 
the mean square error, variance, and bias, respectively of N for given values of 6 and y. M(0,7) 
is the mean square error of N without matching error (i.e. 9 = ¢ = 0) and thus M(0,7) le fs 
approximately the standard error of N. Define RM(6;7) = (M(6;y7) /M(0;7) = fy: 
RV(6;7) = (V(0;7) /M(0;7) —1)%; and RB(6;7) = (B?(8;7) /M(0;7) ) *. 

Thus, RM(6,7) is the square root of the increase in the total mean square error of N for 
given @ and y relative to the root MSE of N with no matching error. RV(6;7) is the contri- 
bution of this increase due to matching variance while RB(6,;7) is the contribution due to 
matching bias. Hence, we have RM (6,7)? = RV(6;y7)? + RB(6;7)*. Figures 3 and 4 show 
these functions for two extreme values of y, y = .5 and 5S, respectively, and for0 <@<.1. 
Again, the maximum value of the correlated variance, CC * , was used for the variance compu- 
tations. Thus, the contribution of matching variance to total error is probably substantially 
exaggerated. 

These figures indicate that for these values of 6 and y, most of the error is contributed by 
bias, although the contribution to variance can be non-trivial. Further, as suggested earlier 
for Figures 1 and 2, the matching bias dominates the total matching error whenever false 
negative error dominates over false positive error. 


4. ESTIMATION FROM REMATCH STUDIES 


Methods for estimating the components of response error in sample surveys have been well 
documented in the literature (see for example Hansen, Hurwitz and Pritzker 1964, Hansen, 
Hurwitz and Bershad 1961). The techniques for estimating the components of matching error 
are essentially the same. For example, to estimate the correlated component of matching 
variance, CC, the assignments of the clerks must be ‘‘interpenetrated.’’ This procedure, which 
is described in detail in Kish (1962), randomizes the assignment of PES cases to clerks so that 
each clerk’s assignment has the same expected number of matched persons. Then, an estimator 
of CC is formed by the difference between the between clerks and within clerks mean squares 
from the analysis of variance of clerks. For more details of this procedure, refer to Bureau 
of the Census (1985). 

In this section, the focus is on the analysis of data from rematch studies, the most common- 
ly used method for evaluating matching error. There are two types of rematch studies. One 
attempts to replicate the original match operation for a sample of cases using the same pro- 
cedures, training, match rules, etc. This type of rematch has the objective of estimating SMV, 
the simple matching variance or, equivalently, J;,, the index of match inconsistency. The 
second type of rematch aims at obtaining the most correct match possible and, therefore, uses 
more extensive procedures, highly qualified and expert clerks, and adjudication, i.e., resolving 
disagreements among the original and rematch classifications by a third, expert matcher. This 
type of rematch as the objective of estimating the matching bias. Further, as we will see, an 
estimate of SMV is also possible from these data. 

The (unweighted) data collected in a rematch study can be displayed as in Table 1. Assume 
that the rematch sample is a simple random sample of r persons from the PES. Further we 
may assume either the uncorrelated error model or the clerical error model of the last section 
for both the match and rematch. Let p,(¢t = a, b, c, d) denote the mean observed proportion 
of the cell corresponding to fin Table 1. 
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Figure 3. RM(9; 7), RV(9; 7), and RB(9;7y), as a Function of 0 for y = .5 
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Figure 4. RM(9;7), RV(9; 7), and RB(9; 7), as a Function of © for y = 5 
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Table 1 
Rematch Study Data 


Rematch Classification 


Original 

Classification Matched ie 

Matched a b 

Not Matched c d 

Then 

Hg = Py, (1-84) (1-43) + (1—pi1) b4bB (15) 
My = Py, (1-84) Og + (1—-pis)b4 (1-$3) (16) 
Me = Py, 84 (1-83) + —pir) l= o,4)o2 (17) 
Ma = Pi 94 9g + (1L—pyy) (1-4) (1-8) (18) 


where the index A denotes original match and B denotes the rematch. 
Define 


+b 
MA = ae) =Dipiu—04) + lit), ea (19) 
and 
+ 
nie -2(***) = P\,(1—6g) + (1—py) op. (20) 


Note that wu, and wp are expected values of the estimates of p,; based upon the original and 
the rematch classifications, respectively. The difference of these two estimates of pj), i.e., 
(b—c)/ris referred to as the net difference rate (NDR). Its expected value is 


E(NDR) = pa — be = —P11(94 — 9g) + (1—pi) (b4 — op). (21) 


Finally, the proportion of the r sample individuals having rematch classifications which 
disagree with the original match classification is (b+ c) /r, referred to as the gross difference 
rate (GDR). Its expected value is 


E(GDR) = pp + Me 


= Py; [04 (1—6g) + (1—64)08) + (A-py) [A —¢4) 63 + 64 (1-¢g)]. (22) 


We shall now consider the estimation of the components of Var (f,,) and Bias (f,,) under 
three sets of assumptions for the rematch study. In the first case, we assume that the rematch 
study is conducted under the same general conditions as the original match so that the error 
parameters associated with both classifications are very nearly the same. For example, the clerks 
for both operations received the same training, have the same skill level, and use the same 
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procedures. The second case assumes that the rematch is perfect, i.e, the rematch classification 
may be considered the true classification. The third case falls somewhere between case 1 and 
2. More extensive and improved matching procedures are used in the rematch; however, we 
are not willing to assume that the rematch classifications are without error. Instead we assume 
that fewer errors are made in the rematch than in the original match. 


Case 1. Same General Conditions for the Match and Rematch 


Assume that 0, = 0g = and ¢, = zg = 9, i.e., the expected rates of misclassification are 
the same for both trials. Then, from (21), E( NDR) = 0 and no estimate of Bias (f,,) can be 
computed from the data. However, from (22) and (8) 


% E(GDR) = SMV (23) 
Further, an estimator of Jy, in (9) is 
fy = GDR/ (2p, (1-1) (24) 


where f, is the PES estimator of p,, as defined for (2). Alternatively, an estimator of F(/,) 
can be obtained from Table 1; for example, see the estimators in (19) and (20). 
Case 2. Perfect Rematch 


Assume that 02 = ¢g = 0, i.e., the rematch is conducted without misclassification error. 
Then, from (21), 


E(NDR) = —p\,94 + A-Pii)¢a 


Bias (f)1). (25) 


Further, the probability of false negative error, 64, is estimated by 


6=c/(atc). (26) 


and, the probability of false positive error, ¢,4, is estimated by 


A 


¢=b/(b+a). (27) 
An estimator of SMV is 
suv =1 = : ae (28) 
r\a+c.\\b+d 
and, thus, an estimator of Jj, is 
iy = SMV/Pu (1-Bu) (29) 


where f), is an estimator of E(f,,) obtained either from the PES or from Table 1. 
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Case 3. Rematch Has Smaller Error But is Not Perfect 


Assume that 0 < 62 < 6, and0 < $3 < ¢,; 1.e., the misclassification probabilities for the 
rematch are smaller than for the original match but are not zero. Then no unbiased estimator 
of Bias (,,) exists. However, |E(NDR)| will be smaller than |Bias (6,,)| if w4 — pj), and wg 
— p,, both have the same sign; i.e., the estimator of p,,; based on the match and the rematch 
data are biased in the same direction. Thus, under these conditions, | NDR| is a lower bound 
estimator of |Bias ();,) |. 

Further, there is no unbiased estimator of SMV. However, it can be seen from (22) that 


E(GDR) —2SMV = p,; (98-84) (1-204) + (1—pi1) (68-—¢4) (1 -2¢,4). 


Thus, whenever @, and ¢, are both less than .5, which is true in most practical applications, 
we have 


E(GDR) < 2SMV 


and fy, defined in (24) will underestimate Jy. 


5. APPLICATION TO THE 1990 CENSUS 


In the 1990 Census, the PES sample will consist of about 5000 ‘‘blocks’’ or groups of about 
30 contiguous housing units and attempts will be made to match each person in every block 
to the census. The variables used for matching will include Name, Address, Relation to Head 
of Household, Sex, Birthdate, Marital Status, Race, and Hispanic Origin. The matching process 
will involve four separate stages as follows: 


Stage 1. A computer match operation using the Fellegi and Sunter (1969) technique. Each PES 
person will be classified as either matched to the census, not matched, or possibly 
matched (i.e., requiring clerical review) by computer. 


Stage 2. A first clerical review to correct any mismatches or erroneous non-matches made by 
the computer. In addition, a standardized set of matching rules will be applied to each 
possible match. Thus, each PES person will be classified as either a match, a non- 
match, a possible match or an unresolved case. 


Stage 3. A second clerical review to reconsider, by applying greater human judgment, the 
classification made at the two earlier stages. The clerks for this stage, referred to as 
the special matching group (SMG), may also decide that for some households further 
field follow-up is required. 


Stage 4. An ‘‘after field follow-up’’ review. Cases are reconsidered on the basis of any addi- 
tional information obtained in the follow-up. The final classification codes are 
matched (enumerated), not matched (not enumerated) or unresolved (match status 
to be imputed in the final processing stage). 
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The procedures for imputing ‘‘matched”’ or ‘‘not matched’’ for unresolved cases are described 
in Schenker (1987). These cases which account for about 1% of the PES sample are not included 
in the tables which follow since the imputed match statuses of the unresolved cases were not 
available for this test. Nevertheless, imputation error can be an important source of matching 
error — one which poses special problems for the evaluation. For example, it is likely that some 
of the PES unresolved cases will also be unresolved in the rematch and no direct estimate of 
misclassification error can be computed for these cases. In the test described below, 83% of the 
unresolved PES cases remained unresolved in the rematch. Conversely, 41% of the cases which 
were unresolved in the rematch, were resolved in the PES match. If one assumes that imputa- 
tions for those cases which were unresolved in the rematch are erroneous, an upper bound on 
the imputation error can be obtained. Likewise, a lower bound can be obtained by assuming 
all these imputations are correct. However, unless the proportion of imputations is very small, 
this ‘‘worst-case, best-case’’ analysis may yield bounds which are too wide to be useful. 

In 1986, a pretest of these PES matching procedures was conducted in Los Angeles. A sample 
of about 4000 persons were matched to the Los Angeles test census and then rematched by 
census professional staff to evaluate matching bias. Special procedures were used in the rematch 
to ensure a very accurate match classification. Table 2 displays the rates of disagreement among 
the four stages of matching and the rematch. Note the improvement of the classifications at 
each higher stage indicated by the decreasing disagreement rate in the rematch column. The 
data also indicate that few classifications are affected in the ‘‘after follow-up’’ stage (.68% 
disagreement with stage 3). Further, the GDR for the final stage (relative to the rematch) is 
very low, less than 1%. 

Under the assumption that the rematch process yields the true match status, Table 3 gives 
the estimates of 6, the probability of false negative error, and ¢, the probability of false positive 
error, for each stage of matching. It appears, that for the computer match and the first level 
clerical match, the false nonmatch rate predominates. However, the opposite is true for the 
final two stages of matching. 


Table 2 
Comparison of Disagreement Rates for Stages of Matching (%) 


Stage 2 Stage 3 Stage 4 Rematch 
Stage | 2.9 4.4 4.7 5)35) 
Stage 2 0 a 4.0 4.8 
Stage 3 333 0 .68 r.6 
Stage 4 4.0 .68 0 .87 
Table 3 


Estimates of 6 and ¢ for Stages of Matching 


Brace Of matching Estimate of @ (x100%) Estimate of ¢ (x100%) 


(false nonmatch rate) (false match rate) 
1 6.2 Des) 
m4 Seah 3.3 
3 ies) 2A 
4 rt 5} 
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Table 4 
Results of the Rematch Study (weighted) 


Original Match Rematch Classification 


Classification Matched Not Matched 

Matched 16690 9 

Not Matched 85 2178 
Table 5a 


Rematch Results For Cases With Agreement On All Four Stages. 


Original Match Rematch Classification 


Classification Matched Not Matched 

Matched 14458 0 

Not Matched 64 fe 
Table 5b 


Rematch Results For Cases With Disagreement On at Least One Stage. 


Original Match Rematch Classification 


Classification Matched Not Matched 
Matched 2223 9 
Not Matched 21 403 


Using the methodology of the previous section, we can estimate Relbias (f,,), Relbias 
(N), and Jy, the index of match inconsistency. Table 4 gives the results of the rematch study, 
weighted for the rematch sample probabilities of selection. For this table, the estimate of Relbias 
(p,,) is —.4% and therefore, the estimate of Relbias (N) is .4%, computed from (2) 
assuming a 1% coefficient of variation for 6,, and replacing Relbias (f,,) by its estimate. Ij 
is estimated to be .49% which is in the very low range. The false positive rate is 6 = .004 and 
the false negative rate is6 = .005. 

As mentioned in the second section, the probability of matching error may depend upon 
the completeness of the PES or census information, among other things. To indicate the extent 
to which match error rates vary, the rematch sample was partitioned into two subsamples. The 
first subsample was composed of cases which were classified as ‘‘matched”’ or ‘‘not matched’’ 
consistently across all stages of matching, i.e., for which all four stages agreed. The remainder 
of the sample made up the second subsample, i.e., cases for which at least one of the stages 
disagreed. This division approximates a division based upon completeness of the matching 
information since most of the cases having no disagreement between stages are those where 
information is the most complete. The weighted results are shown in tables Sa (complete cases) 
and 5b (incomplete cases). 
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For ‘‘complete’’ cases, the false negative rate is .44% while the false positive rate is 0. Thus, 
none of the cases were erroneously matched although a modest number were erroneously called 
nonmatches. These data may provide evidence of the greater skill of the rematch staff at finding 
matches for PES cases. The estimate of Jy, is .39%, very low. For ‘‘incomplete’’ cases, the 
false negative rate is .93% while the false positive rate is 2.18%. The estimate of Jy, is 1.1%, 
still quite low. However, these data indicate a much higher risk of false matches for the 
*‘incomplete’’ cases. 

The data from this study indicates that matching error causes a small negative bias ( — .4%) 
in N which amounts to an underestimate of approximately one million persons (assuming 
N = 250 million persons). Even for the more difficult cases the bias is only —.7%. It would 
be interesting to look at certain demographic subgroups of the population — movers, proxy 
respondents, and apartment dwellers — to see the extent of matching error for these domains. 
Unfortunately, the information that would allow this analysis is not currently available. 


6. SUMMARY 


The models and MSE formulas developed in this paper can be useful for evaluating the 
impact of matching error on estimates of census coverage error. In the context of the 1990 U.S. 
census matching error bias appears to be the largest and most important component of the 
MSE(N). Preliminary studies of the magnitude of matching error bias for the 1990 Census 
indicate that this component is small, less than one half of one percent. This estimate does 
not reflect imputation error which affects about 1% of the PES cases. Moreover, estimates 
of bias depends heavily on the assumption that the rematch process yields the true match 
classification. More work is needed to check the validity of this assumption. 

In the development of the formulas for the total mean square error of N, we assumed that 
N,..was not prone to error. However, in actual practice, an estimate of the numbers of census 
spurious events (or erroneous enumerations), denote by EE, may be subtracted from N,. Since 
this estimator is obtained from a match of a sample of the census units to the PES, EE is also 
subject to sampling error and matching error. For example, a person may be classified as an 
erroneous enumeration when they were correctly enumerated (false positive error), or they may 
be classified as correctly enumerated when they are erroneously enumerated (false negative 
error). The model and methodology formulated for evaluating the effect of false positive and 
false negative errors for x,; can be easily extended for the estimator of erroneous enumera- 
tions. Note that the Taylor approximation formulas for the bias and variance of N, (2) and 
(3), will now contain terms for the bias and variance of EE. 

For future research, studies of matching error correlated variance are needed to inform us 
of the extent to which the clerk variance contributes to the total error of N. We suspect that 
CC, the maximum effect of correlated error, substantially over estimates the impact of clerks. 
Research is also needed from rematch studies to identify the characteristics of persons or 
households prone to matching error. Perhaps then special efforts could be directed toward 
these cases. For this objective, the use of logistic models should be explored for predicting the 
probability a case is misclassified from the various characteristics of the case. 
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APPENDIX 


Derivation of the MSE Formulas 


Let U denote the population of size N to be enumerated. Let U, denote the subset of U 
which is enumerated in the census. Let S denote the PES sample and S, denote SN U,, the set 
of PES persons enumerated in the census. Denote the 7 units in S as u,,...,u,. Define the 
variables 


Ge wei; e 


40d) Ue Se 
and 


y; = 1 if u; classified (by the matching process) in S,. 
= 0 if u; not classified in S,. 


Model for Correlated Error 


Assume: (1)y;1s a random variable with P(y; = 1| 7; = 0) = @and P(y; = 0| n; = 1) = 8, 
and (2) y; and y; are independent given n; and n; for i#j. Let E(- | S) and V(- | S) denote con- 
ditional expectation and variance, respectively, given S. Then, 6,,; = Ly;/n and E(p,,| S) = 
(1—6)p,; + $(1—p,,) where p,,; = £n;/n. Taking expectation with respect to S yields the 
result in (5). 

Further, V(y;| 7; = 0) = ¢(1—¢) and V(),| n; = 1) = 0(1—8). Therefore, V(f,,| S) = 
$(1-¢6) 1—py)/n + 01-86) pi / 7. 

Taking expectation with respect to S yields SMV in (8). 

Finally, combining VE(f,;| S) and EV(f,,| S) yields the result in (6). 


Model for Clerical Error 


Let (i,/) denote the j“ person in the i” clerk’s assignment. Let yi; and nj; be defined in 
analogy to y; and y;. Assume (1) — (3) for the clerical error model. Let E,, V>, and Cy denote 
conditional expectation, variance, and covariance with respect to the clerk error distributions 
holding the sample of clerks fixed. Let EF), V;, and C,, denote the corresponding expectation, 
variance and covariance with respect to the random selection of the k clerk parameter vectors, 
as per assumption (3), holding the sample S fixed. Then 


FE, EF, (p\1) = Fy f Di leGcleraOa)cat ate on} 
n n 
= (1 — A—)p), + $(1 — py) 


where 71;; = y ni and No; = (1 — n,;). Hence, (4) follows upon taking expectation 


of (A.1) with respect to S. A 


Consider the variance of 6,;,;. We have Var(f,,) = VE(/,;| S) + EV(f;,| S) where 
E(@j;| S) is given by (A.1). Further n7V(6;,| S) = Y) ) Vivy| S) + YY) Cov (j,i | S) 


i i j#y’ 
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where V(yjj| S) = V2(yj) + ViE2 (vi) and Cov (y,i"| S) = Cy [Ea (ij), Ea (vy) 1, the term 
E\C,(yi,¥i') being zero. Since EF, (yy) = $;, for ny = 9, and E(y;) = 1—6; for nj = 1, we 
have V,E,(y) = 03, if ny = 0, and = of if ny = 1. Further V3(y;) = ¢;(1 —¢,) for nj = 0 
and V,(y,;) = 6;,(1—96;) for nj = 1. Thus 
E\V2(yj) = (1-4) — of if ny = 0 
= 6(1-6) — ofif ny = 1 
Similarly, it can be shown that, for /#)’, 


C, {E,(%), Ex (vy) } = OG if (ny. ny’) = C1) 


Oil (31) 9) = G50) 


04 if (ny, nj’) = (0,0). 
Therefore, 


V(p,;| S) = (2m? — n)/n? CC + SMV/n. (A.2) 
Finally, combining (A.1) and (A.2) in the identity 
V(b) = VE(B\,| S) + EV(H;;| S), we have 
V(by1) = 1/n(SV+SMV) + (Lm? — n)/n? CC. (A.3) 


If we further assume that m,; = m for all i we obtain the form in (11). 
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In This Issue 


Eight papers in this issue deal with Census Coverage Error. These papers, together with the 
four papers on this topic that appeared in the June 1988 issue, provide the reader a good over- 
view of some of the latest methods available for dealing with census coverage error. A great deal 
of attention has recently been directed at this problem by both policy makers and statisticians. 
In many countries, studies are carried out during or following each census to measure coverage 
error. In Canada, the Reverse Record Check (RRC) is the most important study undertaken to 
measure undercoverage. A Post-Enumeration Survey (PES) is conducted in the United States 
and Australia. 


The papers by Burgess and Romaniuc deal with coverage problems in the Canadian Census 
of Population. Burgess describes the RRC methodology, and considers some of its limitations 
that lead to errors in estimates of undercoverage. Romaniuc, on the other hand, takes a 
demographic approach to the study of the accuracy of the census. The results obtained in this 
way are contrasted with those based on the RRC. In addition, Romaniuc looks at the quality 
of data for components of change (births, deaths, migration) used in the demographic approach. 


Choi, Steel and Skinner’s paper deals with the 1986 Australian PES. Like Romaniuc, the 
authors consider demographic estimates of under-enumeration. Based on their analysis, the 
authors conclude that PES-based adjustments should continue to be used in the 1991 Census, 
but emphasize that investigation of bias problems should continue. 


Cressie uses a model for undercount errors to investigate the adjustment of census counts. 
He considers synthetic estimation, Bayes and empirical Bayes approaches, and uses risk to com- 
pare estimators. A ‘‘usual empirical Bayes’’ estimator is found to have the smallest risk. Cressie 
notes that the results depend on the assumption that a sufficiently large number of households 
are chosen in the PES. 

The paper by Rubin, Schafer and Schenker on imputation for missing values in a PES 
also has a Bayesian flavour. The authors review the imputation methods discussed by Schenker 
in the previous issue of Survey Methodology. They propose two model-based methods, 
and conclude that the method that does not ignore the missing data mechanism is preferable. 
The authors caution that, although their approach looks promising, more work is needed. 

Fein and West present a systematic classification of the causes of undercount and conclude 
that partial household omission is the biggest contributor to the undercount. Methodological 
analysis of total error in the dual system estimator (an estimator that was examined by authors 
in the June 1988 issue) is discussed by Mulry and Spencer. Using a Bayesian approach, the authors 
combine the error components to obtain a final interval estimate of net undercount rate. 

Zaslavsky deals with the undercount problem by using block-level undercount estimates to 
reweight households in the block. An advantage of this approach is that the ‘‘character’’ of each 
block is preserved. The details of the method are interesting and will look familiar to readers 
acquainted with raking methods. 

The development of new computer systems designed to process large amounts of information 
is a topic of increasing interest to survey statisticians. Five of the papers in this issue describe 
Software Development related to survey methodology. 

Automated coding systems developed by central statistical agencies are described in two papers. 
Lorigny deals with the QUID system used at the Institut National de la Statistique et des Etudes 
Economiques. Wenzowski’s paper is a guide to the ACTR system, developed at Statistics Canada. 
Both QUID and ACTR are designed to handle any type of classification system efficiently. 
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Readers will be interested in comparing the approaches taken in the two systems. Some perfor- 
mance data are also given. 

Mudryk describes a computer system for quality control currently used as part of Statistics 
Canada’s overall quality assurance program. The objectives of the system are both to exercise 
error prevention in survey processing operations and to reduce inspection levels progressively 
as the quality of processing improves and stabilizes. 

Deguire describes a system, designed to analyze the syntax of postal addresses, currently under 
development at Statistics Canada. The software produces address search keys consisting of stan- 
dardized address components that can be used during computerized matching operations such 
as those involved in the construction of a national Address Register. 

Emery describes SQL (Structured Query Language), the most popular query language 
associated with relational database management systems. The strengths and weaknesses of the 
language are highlighted. 

In the final paper in this issue, Nathan provides a comprehensive list of over 250 books, theses 
and papers dealing with randomized response. A subject classification is also included. 


The Editor 
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Evaluation of Reverse Record Check 
Estimates of Undercoverage in 
the Canadian Census of Population 


R.D. BURGESS! 


ABSTRACT 


Estimates of undercoverage in the Canadian Census of Population have been produced for each Census 
since 1961, using a Reverse Record Check method. The reliability of the estimates is important to how 
they are used to assess the quality of the Census data and to identify significant causes of coverage error. 
It is also critical to the development of methods and procedures to improve coverage for future Censuses. 
The purpose of this paper is to identify potential sources of error in the Reverse Record Check, which 
should be understood and addressed, where possible, in using this method to estimate coverage error. 


KEY WORDS: Matching; Mobility; Nonresponse bias; Response error; Reverse record check; Sampling 
error; Tracing. 


1. INTRODUCTION 


The Census of Canada is conducted every five years; the most recent was in 1986. Starting 
with the 1971 Census, the main data collection methodology has been self- enumeration: less 
than 4% of the population are enumerated using the canvasser method. In geographic areas 
where self-enumeration is used, each dwelling is listed and a questionnaire dropped off by an 
enumerator just prior to Census Day (June 3 in 1981 and 1986). In larger urban areas the respon- 
dent household is asked to return the completed questionnaire by mail to the local supervisor 
of the enumeration. In rural areas and smaller urban areas the questionnaires are picked up 
by the enumerator. 

The enumerator is to perform basic checks of coverage and response quality for his/her 
assignment and follow up on missing and incomplete questionnaires. Supervisory checks and 
quality control of the enumerator’s work are also carried out. However, there is no indepen- 
dent and rigorous check of the listing of dwellings. Further, there is only limited opportunity 
to verify the number of persons listed on the questionnaire by the respondent household. 

Not unexpectedly there are overcoverage and undercoverage errors in the Census. Such errors 
are important because of the various uses of Census data; representation in the Parliament 
of Canada is determined using Census population counts: various federal- provincial govern- 
ment financial agreements incorporate formulae that have population count or distribution 
as a factor (Statistics Canada 1983b). In turn the quality of estimates of coverage error is an 
important issue: for the use of Census data; in considering adjustment of population and 
dwelling counts to compensate for the coverage error; and in attempting to improve coverage 
quality for future Censuses by identifying significant causes or areas of coverage error. 

Since 1961, Statistics Canada has produced and published an estimate of undercoverage 
for each Census of Population. The method used to produce these estimates has been a Reverse 
Record Check (RRC) study which involves five general activities or stages: 
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(i) frame preparation - identification of a set of nonoverlapping lists that together are to 
cover the total population that should be enumerated in the Census; 


(ii) sample design and selection - selection of a random sample of persons from the lists; 


(iii) tracing - determination of the address of usual place of residence on Census Day for 
each selected person (or verification that he/she died or emigrated prior to the Census); 


(iv) searching - review of Census returns to determine whether the selected person had been 
enumerated or missed in the Census; and 


(v) weighting and estimation - weighting up of sample results to produce an estimate of 
the number of persons missed in the Census. 


A more detailed description of this methodology can be found in Gosselin 1976 or Statistics 
Canada 1984. 


Other methodologies - post Census re-enumeration, demographic analysis and adminis- 
trative record checks - could also be used to estimate Census undercoverage. In the Canadian 
context, however, each of these methodologies would likely produce results less reliable than 
those of the RRC. Re-enumeration studies show a tendency to miss the same households or 
persons as the Census itself. Demographic methods are model-based and suffer from a lack 
of reliable emigration estimates, measure only change in net coverage between censuses, do 
not identify individual cases and causes of coverage error, and are weakened sub-nationally 
by error in internal migration estimates. Administrative record checks are limited by the absence 
of a national administrative system that either has more complete coverage than the Census 
or has coverage errors independent of Census coverage error - a condition that would allow 
an incomplete administrative file to be used. Even if such a complete system existed, its use 
would be another version of a reverse record check, unless it were completely up to date in 
coverage and addresses, as of Census Day. 


For these reasons the reverse record check has been the preferred methodology in Canada, 
though demographic analysis methods have been used for corroborative analysis. However, 
the RRC itself has deficiencies. The purpose of this paper is to describe some o* the sources 
of error or limitations in the RRC method, in the context of the Canadian Census of Popula- 
tion. In Section 2 aspects of the survey methodology of the RRC that can lead to error in the 
final results are reviewed. The results of some analysis of RRC estimates, in conjunction with 
data from other sources, have raised unresolved problems related to the use of RRC results 
in population estimation. These results are presented in Section 3. Some concluding remarks 
are given in Section 4. 


2. LIMITATIONS OF THE REVERSE RECORD 
CHECK METHODOLOGY 


A limitation, in the context of this paper, is anything that restricts the applicability of the 
Reverse Record Check estimates or the confidence with which they can be used. Limitations 
can arise because of: differences between what is conceptually required by users and what the 
RRC attempts to measure; shortfalls in the design of the Reverse Record Check in attempting 
to meet its objectives; or sampling, response and other errors. Some of these limitations might 
be eliminated or reduced through modification of specific aspects of the Reverse Record Check. 
Others will persist or, by their nature, cannot be addressed. 
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2.1 Applicability of Reverse Record Check Estimates 


The objective of the Reverse Record Check is to provide estimates, for each of the ten prov- 
inces, of undercoverage in the Census of Population. Net coverage error is not estimated and 
the Yukon and Northwest Territories are excluded from the study. 

The RRC estimates the proportion of the population missed in the Census - i.e., the pro- 
portion of the population that was not enumerated but should have been. Overcoverage (persons 
enumerated more than once, and persons enumerated who should not have been or were fic- 
ticious) is not estimated by the RRC. Thus net coverage error, undercoverage minus over- 
coverage, is not estimated by this vehicle. Even if the amount is small, the potential importance 
of overcoverage lies in its size and distribution relative to undercoverage. For example, over- 
coverage of 0.2%, one tenth the level of undercoverage in 1976 and 1981, would be very impor- 
tant if the rate for a particular province is as high as 0.5%. 

The two Canadian territories have not been included in the RRC because the size of their 
populations is small but they have exceptionally high rates of intercensal in and out migra- 
tion. In terms of sampling error, to produce reliable estimates for the territories, a propor- 
tionally large sample of the territorial population would have to be selected — of the order of 
a 5% sample or 3,750 persons. The territories have in and out intercensal migration rates of 
a third or more. Therefore, 1,250 of the 3,750 persons (on average) in the minimum sample 
should be intercensal in-migrants, assuming a proportional sample is required. The RRC uses 
lists for which the address of residence for the majority of persons was obtained five years earlier 
and in-migrants to the territories can only be identified during the conduct of the study. This 
in itself is not a problem. However, the RRC uses only a 0.15% sample. The in-migrants to 
the territories, therefore, would be expected to be sampled at this latter rate and not at the 
required 5% rate. This would result in a sample of in-migrants to the territories of only 30 
persons. Thus, within the current framework of the RRC, and without prohibitive additional 
expense, it is not possible to select a meaningful sample to represent that third or more of the 
territorial population who are intercensal in-migrants. 


2.2 The Reverse Record Check Methodology 


Each of the five stages of the Reverse Record Check is a known or potential source of error. 


2.2.1 Frame 


The sample for the RRC is selected from four lists or frames: 


(i) Census: persons enumerated in the previous Census - for example, the 1981 Census 
was used for the 1986 Reverse Record Check; 


(ii) Birth: intercensal births, obtained from vital statistics records; 


(iii) Immigrant: intercensal immigrants, obtained from records of Employment and 
Immigration Canada; and 


(iv) Missed: persons missed in the previous Census - which is available as a sample only 
from the previous Reverse Record Check (no complete list exists for this group). 


These lists are intended to include or represent, without duplication of individuals on or 
between lists, all persons who should be enumerated (in one of the ten provinces) in the cur- 
rent Census. 

Some people, however, are not represented on these lists. Included among these are: (a) 
intercensal and never enumerated illegal aliens; (b) certain classifications of refugee; (c) certain 
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Canadians ‘‘abroad’’ at the time of the previous Census who returned prior to the current 
Census; (d) persons who move from the territories to one of the provinces in the intercensal 
period; and (e) persons not enumerated in any Census covered by the application of the RRC, 
but who were usual residents of Canada prior to 1961. 

It is assumed, without direct evidence, that the number of persons in category (e) has become 
small enough to be irrelevant. For the 1981 Census the size of category (d) was estimated to 
be of the order of 18,000 persons. Most of these persons were usual residents of the territories 
at the time of the previous (1976) Census. There were probably also a few of what would be 
Birth frame and Immigrant frame persons among the 18,000. 

Category (c) includes some Canadians working, studying or travelling abroad who did not 
maintain a usual place of residence in Canada during their absence and may also include children 
born outside Canada to parents in this category. It does not include persons in the Canadian 
military, in External Affairs or other government service (and their families) living abroad. 
They are included in the Census frame and the Missed frame. For the 1981 Census, the size 
of this returning ‘‘abroad’’ group was estimated to be approximately 67,000 persons. 

Refugee applicants and illegal aliens in Canada are to be enumerated in the Census, assuming 
they do not have a usual place of residence outside of Canada, and are not holders of work 
or student visas. For the 1981 and 1986 RRC studies, persons applying from abroad and entering 
Canada as refugees were included in the Immigrant frame. Persons applying within Canada 
were included in the Immigrant frame only if they had been granted refugee status. As of April 
1985, there were 12,500 applications from within Canada under consideration PLAUT 1985. 
The number of illegal aliens in Canada is not known or reliably estimated. Some illegal aliens 
may be represented in the Census frame or even the Missed frame. Amnesty programmes in 
the 1970’s and 80’s will have resulted in some illegal aliens being entered in the Immigrant frame. 

Under the current RRC methodology the exclusions to the frames are important to the extent 
that such persons are not counted in the current Census. Since the Immigrant frame tends to 
have a high undercoverage rate (8.5% compared to 2.0% overall in 1981), it is not unreasonable 
to expect a high undercoverage rate for the refugee status claimants. It is possible that the 
majority of illegal aliens were not counted in the Census. These elements of undercoverage 
could be significant relative to the estimated number of persons missed (approximately 500,000 
in 1981). The refugee status claimants and the illegal aliens may have been clustered in a few 
urban centres within only certain provinces. This would increase the impact of such exclusions 
on the reliability of estimates. 

The lists can also be expected to include some amount of overcoverage; e.g., persons 
enumerated in the previous Census who should not have been or who were enumerated more 
than once, fictitious persons and processing errors. Some overcoverage is detected during the 
course of the RRC operations. In estimating undercoverage, however, the effect of overcoverage 
in the frames would be consequential only if it approaches or exceeds the undercoverage in 
the Census in size. 


2.2.2 Sample Size and Design 


Error due to sampling is a major limitation of the RRC results. While the potential size of 
this error is dependent upon sample size and design, the sample size is the more important ele- 
ment. It, along with the available lists, limits the design options. 

The basic 1981 and 1976 RRC undercoverage estimates for provinces and their correspon- 
ding estimates of standard error are presented in Table 1. The coefficients of variation (stan- 
dard error divided by estimated undercoverage) varied from 4.5% at the Canada (10 provinces) 


Survey Methodology, December 1988 141 


Table 1 


Estimated Population Undercoverage in the 1981 and 1976 Census, 
by Province, showing Provinces with Significant Differences in 
Population Undercoverage (with 95% confidence) 


—_e—ee—e—e—e———eeeeeeeeeeeeee————————— 


Population Undercoverage 


Province with a 
Significantly Different 
(%) (%) Undercoverage Rate 


a nt 
1981 Census 


Province Rate S.E. 


Canada (10 Provinces) 2.01 0.09 

1. Newfoundland 1.74 0.45 10 

2. Prince Edward Island ei ky 0.54 9 and 10 

3. Nova Scotia 1.05 0.34 5, 6, 9 and 10 
4. New Brunswick 1.81 0.30 10 

5. Québec 1.9] 0.21 B orandel( 
6. Ontario 1.94 0.14 3, 7, 8 and 10 
7. Manitoba 0.98 0.35 5, 6, 9 and 10 
8. Saskatchewan 0.99 0.37 5, 6, 9 and 10 
9. Alberta 7 Sy 0.36 Bye TW euwal 3 
10. British Columbia 3.16 0.33 all but 9 


1976 Census 


Canada (10 Provinces) 2.04 0.10 

1. Newfoundland 1.10 0.39 5 and 10 

2. Prince Edward Island 0.38 O25) 4, 5, 6, 8, 9 and 10 
3. Nova Scotia 0.86 0.34 4,5 and 10 

4. New Brunswick 2.16 0.37 a3 /eand lO 

5. Québec 2.95 0.25 1, 2, 3, 6, 7, 8 and 9 
6. Ontario 52 0.17 LS Ds S35. have! 10. 
7. Manitoba 1.07 0.33 4,5 and 10 

8. Saskatchewan 1.33 0.34 2,5 and 10 

9. Alberta 1.49 0.26 2, 5 and 10 
10. British Columbia Balls 0.31 all but 5 


—— eee eee 


level, up to 13.6% at the regional (Atlantic, Québec, Ontario, Prairie and British Columbia) 
level and up to 46% at the provincial level. Sub-provincial coefficients of variation were typically 
higher. For an Electoral District of average size (86,323 persons in 1981) with an estimated 
2% undercoverage, the coefficient of variation would be approximately 50%. For smaller 
geographic areas and small population groups the coefficient of variation could be much higher. 

The sampling error, of course, has an effect on attempts to differentiate among provincial, 
and among other undercoverage rates. In turn this affects attempts to identify specific causes 
or areas of undercoverage, and undermines the validity of adjusting for coverage error as a 
means to improve Census counts. Those provinces with a significantly different undercoverage 
rate are also shown in Table 1. The undercoverage rates for the provinces appear to fall into 
six groupings, for 1981, based on both rate of undercoverage and provinces with which the 
rate is significantly different. For 1976, with eight groups, there was less similarity between 
provinces. No group in either Census, however, can be shown to be completely different from 
all others, and may not be. 
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This general situation is not dissimilar to that for applications of the Reverse Record Check 
for the 1966 and 1971 Censuses. From 1966 onward only the province of British Columbia 
has had an undercoverage rate significantly above the Canada level. The variation from Census 
to Census for most provinces, in large part, could be due to sampling error. Why it is not for 
British Columbia is a major concern for both the Reverse Record Check and the Census. 

The need to use a sample of ‘‘missed’’ persons from the previous RRC also places a limita- 
tion on the design and sample size. There is no direct control of the size of this segment of 
the sample. Any limitations of the previous Reverse Record Check, to the degree that these 
were reflected in the estimate of ‘‘missed’’ persons, will be passed. (See Sections 2.2.4, 2.2.5 
and 3). 


2.2.3 Tracing 


Given the nature of the lists or frames used for sample selection, addresses and other infor- 
mation may be up to five years out of date. Attempts are made to update addresses prior to 
Census Day using administrative files. (This was first carried out extensively for the1986 RRC.) 
After Census Day, the Census questionnaire corresponding to the original address, or the update 
if available, is searched as a first attempt to determine whether the selected person was 
enumerated in the Census. Every selected person not found enumerated in the first search must 
be traced. The selected person, or a reliable source, must be contacted either to obtain an 
updated or confirmed address, or to determine the selected person’s status, /.e., as deceased, 
emigrated, abroad. 

Despite extensive tracing activities, not all selected persons can be traced. This may result 
in a form of nonresponse bias. In the 1981 RRC 3.4% of all selected persons were not traced. 
With overall undercoverage in the Census estimated to be 2.0% this ‘‘not traced’’ rate represents 
an important uncertainty in the RRC estimates. 

A weight adjustment is carried out to account for these ‘‘not traced’’ cases. The effect of 
the weight adjustment for the 1981 Census was to impute an undercoverage rate of 3.27% for 
the ‘‘not traced’’ cases from the Census and Missed frames (jointly), 1.46% for the Birth frame 
and 11.94% for the Immigrant frame. Overall, the proportion of ‘‘not traced’’ weights 
‘“mputed’’ by the weight adjustment to ‘‘missed’’ was 1.6 times the initial (weighted) proportion 
represented by the ‘‘missed’’ cases among all traced selected persons. This suggests a relation- 
ship between ‘‘not traced’’ and ‘‘missed’’. It is not known, of course, if the 1.6 rate was too 
high, too low or correct. To the extent that it is not correct, there may be some distortion in 
provincial estimates of undercoverage as well as a bias in overall estimates of undercoverage. 

Since the rates of intercensal interprovincial in and out-migration vary from one province 
to another, there may be some distortion among provincial estimates. This will occur if the 
proportion of interprovincial movers within weighting groups is not the same among the cases 
traced and not traced. 

Intercensal interprovincial movers (applicable for Census and Missed frames only) havea 
high undercoverage rate. This rate was estimated to be 6.13% for the 1981 Census, based upon 
mobility data from the 1981 RRC derived by comparing of the 1976 Census and 1981 Census 
addresses. The estimated undercoverage rate for intercensal migrants within a province (i.e., 
between Census Subdivision (CSD) or municipality movers) was 3.83%. For intercensal non- 
migrant movers (within CSD or municipality) the undercoverage rate was estimated to be 
2.83%. Given these rates and the distribution of mobility characteristics, the ‘‘imputed’’ under- 
coverage rate for the “‘not traced’’ cases from the Census and Missed frames put together would 
be expected to be at least 3.52% rather than the actual 3.27%. That is, given persons not traced 
almost always have moved. It is, in turn, assumed that these ‘‘not traced’’ cases included 
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proportionally at least as many migrants, within and between provinces, and had not less than 
the same undercoverage rates, by mobility status, as traced cases. (The distribution of mobility 
status of the enumerated population 5 years and older, estimated through the 1981 RRC was 
approximately: (i) Non-movers — 55%; (ii) Non-migrant Movers - 17%; (iii) Migrants Same 
Province - 21.7%; (iv) Migrants Different Province - 5%; and (v) Migrants From Outside 
Canada - 2%.) 

Given the tracing methods used, it is not unreasonable to speculate that the proportion of 
migrants, and thus the undercoverage rate, was much higher for the ‘‘not traced’’ cases. If 
they were, then there could be a significant downward bias in the estimates of undercoverage. 
For example, if the ‘‘true’’ undercoverage rate among the cases not traced was close to 5.0%, 
then the bias in the undercoverage estimate at the Canada (10 provinces) level would exceed 
the sampling error. 


2.2.4 Searching and Classification 


After all tracing attempts have been made and any interviews conducted, each selected person 
is classified to one of six categories: 

(1) enumerated; 

(2) missed; 

(3) deceased; 

(4) emigrated or abroad; 

(5) overcoverage in a list or frame; and 

(6) not traced. 


As outlined above, to determine whether a selected person has been enumerated or missed 
the Census questionnaire corresponding to the selected person’s address must be searched. For 
the search to result in the correct classification of the selected person, it is necessary that the 
address being searched be the correct address, and that the selected person be correctly iden- 
tified on the Census questionnaire and in RRC documentation; /.e., that there be no response 
error or nonresponse for the relevant items. 

If the selected person is correctly identified (complete name, correct age and sex, efc.,) and 
there are no processing errors, then no selected person who was missed in the Census will be 
classified as ‘‘enumerated’’. The converse is not true. If a selected person has been enumerated 
in the Census at some address other than that which is obtained from the list of selection, some 
other administrative source or a directory, then to be classified as ‘‘enumerated’”’ that address 
must be provided by the selected person or some other contact. If the selected person does not 
or can not provide that address (for example, recall error or can not remember), then he or 
she will be classified as ‘‘missed’’ or ‘‘not traced’’. Generally, when the selected person (or 
a parent, spouse or other reliable source) gives an address, or set of addresses, where he/she 
should have been or may have been enumerated, this address information is accepted as cor- 
rect. Selected persons will be classified as ‘‘enumerated’’ or ‘‘missed’’ based on this address 
information. It is not known how accurate such address information actually is for persons 
Classified as ‘‘missed’’. 

On the other hand there may be a higher probability of classifying a person missed as ‘‘not 
traced’’ than a person enumerated in the Census. Before a person can be classified as missed 
he/she (or a reliable source) must be interviewed to confirm the address and to obtain possible 
alternative addresses and certain Census data for him/her and the household. This procedure 
will eliminate some classification error. At the same time, if the information about a person 
missed is doubted this can only be resolved through the contact with him/her (or a parent, 
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spouse, efc.). If the doubt is not resolved the case will be classified as ‘‘not traced’’. Conclusive 
information is not always necessary for a person who was enumerated. With exhaustive sear- 
ching it may be possible to transform a selected person, who was enumerated, from “‘not 
traced’’ to ‘‘enumerated’’, even if the address obtained is incomplete or incorrect. Such sear- 
ching is much less likely to alter the outcome for persons missed in the Census. 

The selected person is not always adequately identified. In accepting a selected person as 
matched; i.e., found enumerated on a Census questionnaire - name is not always identical on 
the Census and RRC documents. Sometimes only the first person listed on a Census question- 
naire has a complete name and in a few cases no names are given. If the identity of the selected 
person cannot be determined from the list or frame, then the case will be classified as ‘‘not 
traced’’ at the outset. Included among these will be persons ‘‘assigned’’ for absent households 
and refusals in the previous Census. Date of birth and other data are not always present, com- 
plete or found identical in matching. For the majority of cases the quality of matching is unques- 
tioned, but a minority of cases raise doubts. Doubtful cases accepted as matched potentially 
are misclassified as ‘‘enumerated’’. Those rejected as matched potentially are misclassified as 
‘“missed’’, though most will be classified as ‘‘not traced’’. Different rules for acceptance/rejec- 
tion as matched, of course, may yield different estimates of undercoverage. 

Some overcoverage in the frames can be detected. This will include: some foreign residents 
enumerated in the previous Census; persons ‘‘created’’ by processing error in the previous 
Census; immigrants who have not yet resided in Canada; births in Canada to non-resident 
parents; and fictitious or out of scope ‘‘persons’’ listed on the questionnaire from the previous 
Census. In 1981 these cases represented less than 0.1% of selected persons. 

Overcoverage in the form of duplication in a frame will not be detected. Fictitious selected 
persons may go undetected and be classified among the ‘‘not traced’”’ cases. 

The final classifications of the selected persons from the 1981 RRC are presented in Table 
2 (from Burgess 1986). 


2.2.5 Weighting and Estimation 


At the time of sample selection, a basic weight equal to the inverse of the sampling fraction 
is assigned to each selected person record. Two types of weight adjustment are made to this 
basic weight - one to account for ‘‘not traced’’ cases, the other to account for deviations in 


Table 2 
1981 RRC Final Classification of Selected Persons 
Frame 
Final Classification Census Birth Immigrant Missed Total 


Cases % Cases % Cases % Cases % Cases % 


Traced 29,761. 9197 Vecdnedil ey 9203 nl g392 af 196. 807 96.1 .35,171-,.96.6 
Enumerated 27,541 89.8 3,096 89.0 1,113” 76.8 696 82.9 32,446 89.1 
Deceased 1,056 aie) 33 0.9 5 0.3 26 S18 EN26 af 
Emigrated/Abroad 299 1.0 34 1.0 111 Loe 24 2.8 468 13 
Missed 865 Des 48 1.4 L637 e138 61 Ase la haha | 


Not Traced (incl. 
Overcoverage 895 2.9 267 Tih ST 3.9 33 Br OTe 252, 3.4 


TOTAL 30,656 100.0 3,478 100.0 1,449 100.0 840 100.0 36,423 100.0 
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the representativeness of the sample, after elimination of ‘‘not traced’’ cases, relative to the 
lists of selection. 

A “not traced’’ case represents a person enumerated or missed in the Census, a deceased 
person, an emigrant, a person abroad or overcoverage. The weights of the ‘‘not traced”’ cases, 
therefore, are redistributed among the ‘‘traced’’ cases. The adjustment is carried out within 
groups defined by various demographic and geographic characteristics, and frame. 

The weight adjustment for the ‘‘not traced’’ cases is carried out in two stages. First, an adjust- 
ment is made for those cases for which no tracing was undertaken because there was inade- 
quate information for matching and tracing. These cases are weighted into all other selected 
persons. Second, an adjustment is made for all other ‘‘not traced’’ cases. These are weighted 
into specific groups of the remaining selected persons. How the “‘not traced’’ adjustment is 
carried out is restricted by the information available on the ‘‘not traced’’ selected persons. 
Ideally, how a selected person was traced and whether he/she had moved and how far, as well 
as demographic characteristics, should be taken into consideration in defining weighting groups. 
To date only demographic characteristics and minimal mobility data have been used in the 
weight adjustment. (Persons selected in the Census frame who have not moved in the intercensal 
period and who were classified as ‘‘enumerated”’ are excluded from this weight adjustment.) 
By their nature it is difficult to categorize most ‘‘not traced’”’ cases beyond the fact that they 
were not found enumerated at the address given on the list of selection. 

For the second type of adjustment, totals for relevant sub-groups of the population are 
obtained from each frame (except for the Missed frame for which only a sample is available). 
Using these ‘‘known totals’’, an adjustment to the RRC weights is made within the correspon- 
ding subgroups of the sample. This is done to reduce the error in the estimates by ensuring 
that totals from the sample, for basic population characteristics for which undercoverage rates 
are published, correspond to the totals in the frames. 

Neither adjustment deals at all with the various exclusions to the lists used for sample selec- 
tion. In the calculation of any proportion of persons missed in the Census the published Census 
count of enumerated persons is used in the denominator in order to minimize sampling error. 
(The covariance of the estimate of ‘‘enumerated’’ persons and the estimate of ‘‘missed’’ persons 
tends to be negative.) Since the RRC does not represent all elements of the true population, 
the effect of using the Census count is to assume that the undercoverage rate for the exclu- 
sions is zero. 

The estimator, which takes the general form defined as: 


Estimated proportion of persons missed 
Estimated no. of missed persons 


no. of persons counted in the Census + Estimated no. of missed persons 


is discussed further in Appendix 2. 


2.3 Reducing Potential for Error and Methodological Limitations 


Experimental work and evaluation of methods in the RRC may make it possible to elimi- 
nate or reduce the impact of some sources of error or limitations. 

Overcoverage might be estimated by means of an independent study. Such a study is being 
conducted, on an experimental basis, for the 1986 Census. However, the cost to produce 
estimates of adequate quality at the province level may be very high. 

The production of estimates for the Yukon and Northwest Territories requires a set of lists 
other than those used for the RRC. Such a set would have to be current and have no signifi- 
cant duplication that could not be removed or estimated. With such a set of lists, the basic 
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RRC methods could be applied. Some experimental work in this regard has been done and 
more is planned. 


The lists used for the RRC could be augmented to eliminate some of the exclusions, for 
example, refugee status claimants and migrants from the territories to the provinces. These 
people, however, will be difficult to trace. Sampling these groups may do little more than change 
the nature of the problem. 


A sample of ‘‘abroad’’ persons could be obtained by using the previous Reverse Record 
Check. Such a sample, however, would be very small, would not represent the entire group 
in question and the selected persons would be difficult to trace. 


Other than illegal aliens the ‘‘never enumerated”’ group will become smaller and smaller 
over time. Intercensal illegal aliens, and other illegal aliens never enumerated in Canada, will 
remain excluded. 


The impact of sampling error can be reduced by increasing the sample size. The question 
is to what size, at what cost, based upon what criteria? An increase in the RRC sample from 
its current 36,500 persons to 100,000 should be sufficient to bring the provincial standard error 
estimates, for the undercoverage rates, down below 0.2%. However, this may not be suffi- 
cient for purposes of adjusting the Census counts, depending upon the level and distribution 
of undercoverage estimates actually obtained. A reduction of the standard error to 0.1% for 
each province - the level yielded by the 1981 and 1976 RRC studies for the Canada (10 prov- 
ince) level estimate of undercoverage of 2% - would require a sample for Canada of approx- 
imately 350,000 persons, assuming the 1981 provincial levels of undercoverage, type of sample 
design and design effects. To conduct a high quality RRC operation for such a large sample, 
given the controls and quality checks required, would be much more costly than the mere 
increase in sample size suggests, and might be operationally unrealizable. Increasing the sample 
size, of course, would not reduce any bias in the estimates. 


Tracing methods are examined before and after each RRC. Major changes were made for 
1986 and changes and improvements are being contemplated for 1991. It must be expected, 
however, that there will again be a non-negligible percentage of ‘‘not traced’’ cases. These cases 
will continue to be dealt with by weighting or by imputation and weighting. 


Evaluative studies can be conducted to assess the quality of matching and of address infor- 
mation provided by respondents or reliable sources. The potential impact of the matching 
algorithm or criteria can also be assessed to some extent. However, even if such studies iden- 
tify a problem, solutions may not be readily forthcoming. 


Modifications to the weighting procedures can be tested in an attempt to better deal with 
mobility and other characteristics when adjusting for ‘‘not traced’’ cases (Burgess 1986). Addi- 
tional information for this purpose might be available from administrative sources. Some minor 
refinements using existing information can also be made. For example, the adjustment for ‘‘not 
traced’’ persons contacted, but from whom the necessary Census Day address information 
could not be obtained, might be different from that for ‘‘not traced’’ persons who potentially 
may be ‘‘deceased’’, ‘‘emigrated”’ or ‘‘abroad’’. 


Adjustments using current Census totals of enumerated persons could be tested as well. For 
this to reduce any bias associated with ‘‘not traced’’ cases and persons not represented in the 
RRC sample, however, the basic classification of cases to ‘‘missed’’ must be without bias and 
there must be no interprovincial distortion of the proportion ‘‘missed’’. These types of 
modifications to the weighting would not in themselves eliminate bias. 
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3. ANALYSIS OF REVERSE RECORD CHECK RESULTS 


The RRC not only provides estimates of the number of persons missed in the Census, but 
also independent estimates of the number of persons enumerated in the Census, and the number 
of intercensal deaths, emigration and persons who have moved abroad but who have not 
emigrated. These estimates are used in validating RRC estimates. Some of the results of this 
validation process serve to illustrate limitations discussed in Section 2. 

Analysis has also been carried out to correlate geographic variation in undercoverage to varia- 
tion in the distribution of Census population and household characteristics. 


3.1 Independent Estimates 


The Reverse Recored Check estimates of persons enumerated in the Census, of intercensal 
deaths, and of persons leaving Canada in the intercensal period can be compared to estimates 
from other appropriately chosen sources - for example, estimates of enumerated persons to 
Census counts and estimates of deaths to Vital Statistics data. If there are no significant biases 
in the RRC estimates, then any differences between these estimates will usually be explainable 
by the corresponding sampling error of the RRC estimate. If there are significant differences, 
then these might be due to biases in the RRC estimates. The overall quality of these estimates, 
revealed by the comparisons, likely will be a reflection of the quality of the estimates of 
‘*missed’’ persons. 

RRC estimates of emigrants (296,727) and of persons ‘‘abroad’’ (57,909) compared 
favourably with estimates based upon demographic analysis. The RRC estimate for emigrants, 
for example, is in the mid range of the five demographic analysis values examined - ranging 
from 197,000 to 372,000, with a mean value of 266,400. The RRC estimate of deceased persons 
(846,378) is very close to the value (840,689) published by Statistics Canada 1976 to 1981. 

Comparisons of estimates for enumerated persons do indicate some problems. Some of these 
comparisons are presented in Table 3. For Canada (10 provinces) and for two of the ten pro- 
vinces, the number of persons enumerated in the Census, as estimated by the RRC, is 
significantly different from the published Census count. The discrepancy of 209,911 at the 
aggregate level can be explained in part by exclusions from the lists or frames of the RRC. The 
discrepancies among provinces is difficult to explain. That in particular makes the discrepancy 
important. The 209,911 aggregate discrepancy must be considered in the context of the RRC 
estimate of 497,277 persons missed in the Census; similarly, the discrepancy for British Col- 
umbia of 80,304 in the context of an estimated 89,445 persons missed and the discrepancy for 
Alberta of 86,244 persons in the context of an estimated 58,335 persons missed. 

An estimated 67,000 non-immigrants who had been ‘‘abroad’’ at the time of the previous 
Census arrived in Canada legally, and an estimated 18,000 persons moved from the territories 
to a province in the intercensal period. Assuming none of these people was missed in the Census, 
the discrepancy would be reduced to approximately 125,000 persons. This difference would 
remain at the outer limits of what would be reasonably accepted as due to sampling error only. 
Further, all of these 85,000 (67,000 + 18,000) persons would have had to have moved to 
Alberta and British Columbia to reduce the discrepancies for these provinces to within 95% 
confidence intervals - a clearly unreasonable supposition. 

The remainder of the difference (125,000) could be made up of various (potential) errors 
in the RRC or the Census: (i) sampling error in the RRC estimate of enumerated persons; (ii) 
an increase in overcoverage in the 1981 Census - compared with the 1976 Census; (iii) RRC 
exclusion of illegal aliens and refugee claimants enumerated in the Census itself; (iv) 
underestimation of persons missed in the 1976 Census - these persons make up 1981 Missed 
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Table 3 


Reverse Record Check Estimates of the Number of Persons 
Enumerated in the 1981 Census by Province 


RRC Estimate Sway Or Census Persons RRC Estimate 

Province of Persons RRC Published! Enumerated of Persons 

Enumerated Estimate Count RRC-Census Missed 
Canada (10 provinces) 24,064,376 62,193 24,274,287 -209,911? 497,277 
Newfoundland 568,696 8,256 567,681 1,015 10,039 
Prince Edward Island 116,012 3,005 122,506 -6,494 1,456 
Nova Scotia 837,045 RSS 847,442 -10,397 9,034 
New Brunswick 685,332 8,167 696,403 -11,071 12,864 
Québec 6,410,662 38,648 6,438,403 —27,736 125,180 
Ontario 8,629,374 52,802 8,625,107 4,267 171,010 
Manitoba 1,028,162 15,133 1,026,241 1,921 10,203 
Saskatchewan 973,450 11,740 968,313 5,137 9,712 
Alberta 2,151,480 24,238 2 23124 -86,2442 58,335 


British Columbia 2,664,163 19,798 2,744,467 -80,304? 89,445 


! Statistics Canada 1982. 
Greater than 3 standard errors. 


frame; and/or (v) over-estimation of persons missed in the 1981 Census. The extent to which 
each of these sources might have contributed to the difference is not known. The fact that a 
large part of the difference seems to be associated with British Columbia and Alberta is per- 
haps in some degree due to under-estimation of intercensal migrants. Migration to these prov- 
inces was particularly high between 1976 and 1981 (Statistics Canada 1979; 1983a). 

There may also be some bias in the estimates of emigrated, abroad and/or deceased persons. 
If these are over-estimated for reason other than ‘‘not traced’’ bias, there should also be a 
tendency to under-estimate the persons missed, since the last address in Canada is sought and 
used in searching. Persons who emigrated, died or went abroad after Census Day may have 
been reported as such at the time of tracing, perhaps several months after Census Day. At the 
same time, the fact that deceased persons do not appear to have been under-estimated despite 
the exclusions to the RRC frames suggests a lower mortality rate for the exclusions (as is the 
case for immigrants - see Table 2) than for the entire population and/or over-estimation of 
this group. 

The data in Table 4 show that intercensal migrants were under-estimated for all provinces 
except Saskatchewan. This may be in part associated with the ‘‘not traced’’ cases. The under- 
estimation for British Columbia may explain the discrepancy for this province shown in Table 
3. On the other hand, the under-estimation for Alberta does not adequately explain the 
discrepancy for that province and, thus one or more of the factors (i) to (v) noted above must 
be contributing to this discrepancy. 

Under-estimation of migrants may cause a distortion of undercoverage estimates among 
the provinces; /.e., the large differences shown in Table 4 by province might be indicative of 
substantial biases in provincial under-enumeration rates. Further, as noted in Section 2.2.3, 
migrants have higher than average levels of undercoverage. If the enumerated persons within 
this group are under-estimated, while in general non-migrants are not under-estimated, relative 
to the Census, then estimates of undercoverage may be too low. 
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Table 4 
Reverse Record Check Estimates of Migrants! Enumerated 
in the 1981 Census, by Province 
: 6 Census Estimate of Inter- 
re ae ants Provincial Migration 
Province Census 
RRC published _Difference In Out Out/In 
estininmabe2 RRC-Census 

Canada 4,670,311 5,046,500 -376,239 151242970 el 1222870 - 
Newfoundland 61,499 72,100 -10,601 18,430 38,265 2.08 
Prince Edward Island 132257 20,530 - 7,273 9,945 9,950 1.00 
Nova Scotia 125,949 137,865 -11,916 54,455 62,880 1.16 
New Brunswick 96,607 109,955 -13,348 41,460 49,965 VA 
Québec 1,092,919 1,145,085 -52,166 61,310 203,035 Beal 
Ontario 15572,504 157255225 -152,721 250,570 328,640 Weg 
Manitoba 143,391 165,105 =21,714 54,030 97,620 1.81 
Saskatchewan 204,937 192,840 12,097 63,395 69,220 1.09 
Alberta 669,995 691,970 -21,975 336,830 139,180 0.41 
British Columbia 689,253 785,825 -96,622 234,545 2876S 0.53 


1 A migrant is a person who at the time of the previous Census was living outside Canada, in a different province 
or in a different municipality (or CSD). RRC mobility data used here are those given by the RRC sample person 
in the Census and not those derived within the RRC (based upon a comparison of addresses). 

2 Statistics Canada 1983a. 


Discrepancies between the RRC estimate of enumerated persons and the Census count have 
also occurred for earlier Census. The value of the RRC estimate minus the Census count was 
289,000 for 1971, and -324,000 for 1976. For both of these Censuses, the RRC estimates of 
persons deceased and emigrated/abroad were consistent with other sources. The large change 
from 1971 to 1976, coincident with the large negative values for two consecutive Censuses, 
cannot emanate from a single source. Changes in the size of overcoverage, larger than the size 
of the discrepancies, would be required between Censuses. This by itself, however, would not 
be consistent with the results of demographic analysis for these three Censuses (Statistics Canada 
1987). 

Remaining consistent with the demographic estimates, the differences would be explained 
in part by the presence of a large downward bias in the 1971 RRC estimate of persons missed. 
The 1971 unbiased estimate would have to be of the order of 3.8% rather than the estimated 
1.9%. This would have to be accompanied by a not as large decrease in overcoverage between 
1966 and 1971 followed by an increase in overcoverage for 1976 and a decrease for 1981. There 
would have to be also some under-estimation of missed persons for 1976. 

Such a scenario is speculative, however, and no reason was found for such changes occur- 
ring. Other scenarios may also be possible. The occurrence of the discrepancies, however, does 
raise questions about the reliability of the RRC estimates and the potential effect of over- 
coverage on net coverage error. 

The provincial distribution of the discrepancy between the RRC estimate of enumerated 
persons and the Census count differ among Censuses, further confounding its effects and poten- 
tial sources. These results for the 1976 Census are given in Table 5. 
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Table 5 


Difference Between Reverse Record Check Estimates of Persons 
Enumerated and the 1976 Census Counts 


Difference in 
‘ Population Percent 
Province Enumerated Difference 
(RRC-1976 Census) 


Canada (10 provinces) -323,500 -1.4 
Newfoundland 21,900 3.9 
Prince Edward Island -500 -0.4 
Nova Scotia —- 4,500 -0.5 
New Brunswick -15,000 -2.3 
Québec —56,200 -0.9 
Ontario -207,000 -2.5 
Manitoba — 6,600 -0.6 
Saskatchewan 1,400 0.1 
Alberta —43 ,400 —2.4 
British Columbia —12,800 -0.5 


3.2 Variation in Geographic Distributions 


The RRC estimates of undercoverage can be used as general indicators of the coverage quality 
of the Census. They are also intended to be used to direct the development and testing of cov- 
erage improvement procedures for future Censuses. Under ideal circumstances, they would 
be used to model undercoverage to produce estimates for small areas and as part of a coverage 
adjustment ‘‘correction’’ procedure. For these uses, geographic variation in coverage quality, 
indicated by the RRC results, is of particular concern. Variation in Census data distributions 
have been examined to determine whether they are correlated to the apparent variation in under- 
coverage among provinces. To date these investigations have not yielded satisfactory models 
or explanations. 

A lack of success modelling undercoverage or explaining the variation between provinces 
may be due to, or confounded by: (i) bias and/or sampling error in the RRC estimates; (11) 
undercoverage not strongly correlated to the Census characteristics of individuals, households 
and/or families; (iii) undercoverage correlated to a perhaps complex combination of Census 
and other characteristics; and/or (iv) a multitude of sources of undercoverage that must be 
considered separately; for example, undercoverage of individuals considered separately from 
undercoverage of entire households. 


4. CONCLUSION 


The RRC is thought to be the best vehicle developed to date for estimation of undercoverage 
in the Census in Canada. Its estimates provide basic measures to monitor and assess the quality 
of Census counts. 

There are conceptual, theoretical and practical limitations to the RRC Check method as 
currently applied to the Canadian Census. The frames or lists used, while covering the large 
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majority of the population to be enumerated, are not comprehensive. Specific geographic areas 
are excluded as are certain segments of the population. The sample size is limited, but not 
necessarily to its present size, by constraints of tracing and matching, and by the demands for 
accuracy in operations. The ‘‘not traced’’ cases are a source of bias. The proportion of cases 
not traced, relative to the proportion of ‘‘missed’’ cases, in particular, adds an important uncer- 
tainty to the estimates, as does the inconsistency of RRC estimates of enumerated persons with 
corresponding Census counts. 

In some instances the degree or impact of error, or limitations, could be evaluated in greater 
depth. Modifications and alternative procedures or methods that have a reasonable likelihood 
of improving the quality and applicability of the estimates can be applied. Potentially, alter- 
natives can be developed. Such changes, however, would have varying costs and degrees of 
effectiveness associated with them. Also, it remains to be shown whether such changes would 
do more than enhance the status of the RRC estimates as general indicators of coverage quality 
in the Census. 
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Appendix 1 
Further Results From the Reverse Record Check 


Results from the 1986 Census Reverse Record Check have been published (Statistics Canada 
1988). The following extract displays the undercoverage rates for the 1981 and 1986 Censuses 
for demographic characteristics. Analysis of the 1986 undercoverage estimates by province, 
age, sex, marital status, mother tongue and other groupings is continuing. 


1981 and 1986 Reverse Record Check Undercoverage Rates for Selected 
Population Characteristics - 10 Provinces 


EEE 


1981 Estimated 1986 Estimated 
Population Population 
Characteristic Undercoverage Undercoverage 
Rate Sr: Rate SHES 
% % % % 
Sex 
Male POM) 0.13 3.91 0.16 
Female 1.65 0.12 uh 0.16 
Age Group 
0- 4 V2) Or22 2.28 0.48 
5-14 bez 0.21 Zaz 0.26 
15-19 2.96 Dey 3.89 0.60 
20-24 5351 0.29 9.06 0.45 
25-34 pF) | 0.28 4.76 0.32 
35-44 2.20 0.26 2.40 0.32 
45-54 0.81 0.23 ies 0.28 
55-64 0.91 0.29 2.09 0.31 
Marital Status 
Married/Separated 22 OLE 1.89 0.15 
Divorced S280 1.03 7.07 1.07 
Widowed 0.64 0.39 2.68 0.51 
Single/Never Married 2.86 0.16 4.91 0.21 
Mother Tongue 
English 1.86 0.11 3:12 0.13 
French 1.80 0.20 3.0 0.33 
Other 3.08 0.26 - - 
Urban/Rural Population 
Size Group 
Urban Areas 2.08 0.11 3.28 0.13 
500,000 & over 2.29 0.17 3.58 0.15 
100,000 to 499,999 1.86 0.31 2.94 0.33 
Less than 100,000 1.80 0.23 - - 


Rural Areas 1.79 0.21 Bere 0.29 
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Appendix 2 


Equations Used to Assess RRC Estimates and Estimator 


The 1981 Reverse Record Check estimates have been assessed and discussed based upon four 
equations. The first simply defines the RRC population or frames. The second redefines the 
RRC sample in terms of the outcome or estimates of the study. The third defines the popula- 
tion enumerated in the Census in terms of the RRC estimate of enumerated persons. The fourth 
defines the error components for the estimate of missed persons. 


Equation 1: 


The RRC population size = Cy + Mr, — e (M7) + Tes: + Br6/81, 


where 
Cr 


M76 
e(My6) 


156/81 


By6/81 


Equation 2: 


number of persons counted, or enumerated, in one of the ten provinces in the 
1976 Census, 
number of persons missed in one of the ten provinces in the 1976 Census, 


= error (under or (-) over estimation of persons) associated with M7, the Missed 


frame sample; i.e., My, = My, + e(My), 


= number of registered 1976 to 1981 intercensal immigrants to one of the ten 


provinces, 


= number of registered 1976 to 1981 intercensal births in one of the ten provinces. 


The RRC estimates = Gi at Crrsi aia Mg; aF Myr at Livei is Ae ai Da6/81 alg Ors 


where 


estimated number of persons in an RRC frame who were enumerated in one of 
the ten provinces in the 1981 Census, 

estimated number of persons in an RRC frame who were enumerated in one of 
two territories in the 1981 Census, 

estimated number of persons in an RRC frame who were missed in one of the 
ten provinces in the 1981 Census, 

estimated number of persons in an RRC frame who were missed in one of the 
two territories in the 1981 Census, 

estimated number of persons in an RRC frame who were 1976 to 1981 intercensal 
emigrants, 

estimated number of persons in an RRC frame who were abroad and had no 
usual place of residence in Canada at the time of the 1981 Census, 

estimated number of persons in an RRC frame who died in the 1976 to 1981 
intercensal period, 

estimated overcoverage (number of ‘‘persons’’) in the Census, Birth and 
Immigrant frames which was detectable in the 1981 RRC operations. 
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Equation 3: 


Burgess: Evaluation of Reverse Record Check Estimates 


The estimate Cy, should = Cg, — Cg: [e(M76)] — Cevresi — Rai — T76/81 — S76/81 


where 


Cri in = 
Cai le(M7e] 
Corres! —s 


C(Ongsi) 


Thus, 
Cg, — Cai 


+ Mrs, — Og, + C(Onpsi) » 


number of persons enumerated in one of the ten provinces in the 1981 Census, 
that component of e(M7,¢) not or (-) over represented in Cj, 

under or (-) over-estimation of ‘‘enumerated’’ persons in an RRC frame 
because of classification, response, sampling and ‘‘no trace’’ error in the 1981 
RRC, 

number of persons abroad at the time of the 1976 Census who were in Canada 
at the time of the 1981 Census, 

number of intercensal migrants from the two territories to a province, 

net number of intercensal entries to the ten provinces, as of Census Day, not 
in an RRC frame and not accounted for above (e.g., illegal aliens), 
number of persons not ina RRC frame who were missed in one of the ten prov- 
inces in the 1981 Census, 

overcoverage in the ten provinces in the 1981 Census, 

estimated overcoverage (number of ‘‘persons’’) in the Census, Birth and 
Immigrant frames which was not detected in the 1981 RRC operations and 
is represented in Cg,. 


— Ce, [e(My6)] — Corres: — S161 + Mnsi — Os + C(Onys1) 
— R671 — T6781 


assuming no error in Onrgy- 


Equation 4: 
Ms, — Ma 


where 


~ 


Meyresi 


Ma [e(M6) ] 
Mg) ( Crest) 


Mg; [e( M36) ] ant Mg, (Onysi) - Mores = Mrs} = e(Mg). 


under or (-) over-estimation of ‘‘missed’’ persons in an RRC frame because 
of classification, response, sampling and ‘‘no trace’’ error in the 1981 RRC, 
that component of e(M7,), represented in Mg,, 

estimated overcoverage (number of ‘‘persons’’) in the Census, Birth and 
Immigrant frames which was not detected in the 1981 RRC operations and 
is represented in Mg). 


Note: There is a classification, response, sampling and ‘‘no trace’’ error component associated 
with each item of equation 2; e.g., Coreg, and M-,7eg1. These taken in total sum to zero. 
In the above equations these error components exclude error caused by overcoverage 
and overcoverage which results in a ‘‘not traced’’; e.g., non-existent persons enumerated 
in the previous Census. The effect of overcoverage is included, for example, in 
C(Onpgi) and M(Oy,z) - 


Similarly, 


e(My6) = Mygle(My1)] — My6(Onr76) +Mejre6 + Mnie- 


Survey Methodology, December 1988 155 


Error and part of the difference C — Ccan be passed from one RRC to another through the 
Missed frame and through overcoverage in the Census frame. This error could account for a 
large part of the difference Cg; — Cg,. The effect on C — C may be much greater than on M7. 


The rate of net coverage error in the 1981 Census, for the ten provinces, would be equal to: 


Ms, + Mngsi — Og 
Cg, + Mg; + Mig, — Os; 


and the rate of undercoverage would be: 


Msg; + My 8; 
Cg, + Mg; + Myg, — Os; 


The estimator used in the RRC is 
Ms, 
Cg, + Mg; 


Even a relatively small value of M,; — Og; could contribute significant bias to the results 
of the RRC, if these results are used as estimates of net coverage error. A relatively small value 
of e(Mg,) could contribute significant bias to the RRC undercoverage estimates: two poten- 
tial elements of bias coming from the previous RRC; one from any misclassification within 
the RRC; and one from ‘‘missed’’ persons among those not included in an RRC frame. There 
may be, of course, some cancellation among these elements. 

An alternative estimator would be to use Cg, instead of Cg, in the denominator. There are 
specific and not unlikely circumstances under which the use of Cs; would produce estimates 
with less bias at the national level. These circumstances, which involve the relative sizes of 
Cg, — Cg,, Og, and M,s, do not hold, however, for provinces or estimates for which the 
Census count of enumerated is less than the RRC estimate of enumerated. 
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ABSTRACT 


A significant increase in coverage error in the 1986 Census is revealed by both the Reverse Record Check 
and the demographic method presented in this paper. Considerable attention is paid to an evaluation 
of the various components of population growth, especially interprovincial migration. The paper con- 
cludes with an overview of two alternative methods for generating postcensal estimates: the currently- 
in-use, census-based model, and a flexible model using all relevant data in combination with the census. 


KEY WORDS: Census undercoverage; Population estimates; Demographic component method. 


1. INTRODUCTION 


The accuracy of the census, and of the postcensal population estimates based thereon, is 
an important issue in its own right. The use of population numbers in the formulae for 
calculating revenue transfers between various levels of government, makes the question of 
accuracy all the more critical and politically sensitive (Fellegi 1980; Romaniuc and Raby 1980). 
The intense debates on whether or not to adjust population counts for census undercoverage 
in Canada and the USA, and several judicial litigations fought in the latter country in recent 
years, are indications of both the political importance and the technical complexity of the issue. 

Yet, in spite of all that has been written on the subject, the elaborate arguments marshalled 
by both those for and those against adjustment, the debates remain inconclusive (Keyfitz 1979 
and 1981; Kish 1980; Spencer 1980; Freedman and Navidi 1986; Stoto 1987). Eventually 
Statistics Canada decided (as did the US Department of Commerce) against adjustment for 
census undercoverage, while at the same time reaf firming its long-standing commitment to the 
policy of data quality evaluation (Wilk 1981). By making public both the evaluation results 
and the underlying methodology, the users can make adjustments to suit their particular needs, 
in full knowledge of the strengths and limitations of the census counts and estimates. It is in 
the spirit of this policy on quality evaluation that this paper has been written. 

There are basically two approaches to the evaluation of the accuracy of census counts. One 
is the ‘“‘micro’”’ approach, involving individual verification, case-by-case record matching, in 
order to identify persons who have been missed, enumerated more than once, or enumerated 
even though, by definition, they are not part of the census universe. To this type of evaluation 
belong the US Bureau of the Census Post-Enumeration Program and Statistics Canada’s 
Reverse Record Check (RRC). 
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The second is the ‘‘macro’’ evaluation approach involving an analysis at aggregate levels, 
such as comparison of the census counts with figures derived from independent sources or with 
estimates arrived at by means of statistical and demographic methods. Following the pioneering 
work by Ansley Coale (1955), the demographic techniques of analysis have been used by the 
US Bureau of the Census to evaluate census coverage concurrently with the Post-Enumeration 
Program (see most recent report by Fay, ef a/. 1988). Some earlier attempts of this kind in 
Canada were also made (Lapierre 1970). The essence of the demographic method, as we shall 
see later, is that it brings to bear the formal relationship between population and its growth 
components - namely births, deaths and migration. 

The evaluation of the 1986 Census coverage through the Reverse Record Check (RRC) has 
been carried out and reported upon elsewhere (Carter 1988; Statistics Canada 1988). It suf- 
fices to say that the RRC-based estimates of undercoverage are subject to sampling error - 
which can be quite significant for provinces with a small population - and to biases of unknown 
magnitudes (difficulties in tracing persons or matching individual records). Furthermore, the 
RRC has been designed primarily to measure undercoverage. The measurement of overcoverage 
has been attempted on an experimental basis, but at the time of writing, the results were 
unavailable. For these and similar reasons, an alternative assessment of the accuracy of the 
census counts becomes all the more important. 

This paper evaluates, by means of demographic analysis, the accuracy of the three most 
recent censuses, with emphasis on the 1986 Census. A three-step operation is followed. First, 
census counts and population estimates are compared with each other. Second, demographic 
techniques are used to generate alternative estimates of census undercoverage which are, in 
turn, compared with those based on the Reverse Record Check. As a third and final step, the 
focus of evaluation is shifted from census counts to intercensal change in population. Two 
sets of independent estimates of intercensal population change are produced. One is based on 
the two consecutive censuses, while the other is obtained directly from data on births, deaths 
and migration. 

Before proceeding with the actual evaluation, a word of caution is in order. Though of accep- 
table quality for most of the uses they serve, neither census counts nor population estimates 
are perfect. Indeed, there is no one set of data deemed to be perfect enough to serve as a ben- 
chmark for the validation of other data. The statistical reality is that data are imperfect in 
varying degrees. The fine tuning and high precision that would be required for particular uses 
— such as government allocations and revenue transfers referred to earlier - might not be 
attainable under the present state of the art. However, we hope that this evaluation, using a 
combination of statistical tools, imperfect as they may be, will enable us to get some sense of 
the direction and magnitude of errors and biases affecting census population counts and various 
components of population estimates. Such an undertaking will hopefully set the stage for 
improvements as we work toward the 1991 Census and the post-1991 population estimation 
methodology. 


2. CENSUS COUNTS VERSUS POPULATION 
ESTIMATES: ERROR OF CLOSURE 


The postcensal estimates of population are obtained, as per equation 1, by the so-called com- 
ponent method, whereby births and immigrants are added to, and deaths and emigrants are 
subtracted from, the base census population. The net interprovincial migration is then added 
to estimate population by province. The procedure is repeated annually over the five-year period 
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to the next census. The current estimation methodology calls for postcensal estimates to be 
revised retrospectively so as to bring them in line with the latest census counts (Statistics Canada 
1987). The difference, as per equation 2, between estimates thus arrived at and census counts 
is termed ‘‘the error of closure’’ (EC). 


P, = Ri_s ae = a Dr~ 51 = 1 ae 7 pene aL is, | (1) 
ECON) = ———_ %.400, (2) 
R, 
where: 
P, = estimated population at time ¢; 
R = census counts at time f or f—5 as the case may be; 
B = number of births; 
D = number of deaths; 
if = number of immigrants; 
E = number of emigrants as estimated; 
N = net interprovincial migration as estimated; 


t—5,t indicates the five-year period during which the events occurred. 


Table 1 presents the error of closure for the last four censuses for Canada, provinces and 
territories. On the whole, agreement between the census counts and the population estimates 
is fairly good even for provinces. This is all the more remarkable considering the fact that, 
in the absence of direct records, both emigration from Canada and interprovincial migration 
have to be estimated from administrative data (family allowance and income tax files). 

Despite the high level of agreement, there are two salient features in the error of closure. 
One such feature is the jump to nearly one percent error of closure in 1986, a relatively large 
error when compared to that in the previous censuses. For the 1971 and 1976 censuses the error 
stood at slightly over one-half of one percent and only at one-quarter of one percent in 1981. 
The other feature is the negative error of closure in 1981. Whereas in the other three censuses, 
the estimates exceeded the census counts, in 1981 the former fell short of the latter. Almost 
all of this shortfall originated in the province of Alberta. 

Turning to the provinces, one notes a consistently positive error of closure in 1986, whereas 
the sign of the error varied in the previous three censuses. Furthermore, for most of the 
provinces, the magnitude of the error has increased in 1986 as compared to the previous 
three censuses. The larger errors of closure were found in the Maritime Provinces and Quebec, 
and the smaller in Ontario and in the Western Provinces, with the exception of Saskatchewan. 

The 1981 case of Alberta, referred to above, calls for some further remarks. In 1981, this 
province had to contend with an unusually large negative error of closure: the estimates fell 
short of the census count by 53,886 individuals or 2.41%. There are two possible explanations 
for this outcome. One is that the 1981 Census in this province may have suffered from a 
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Table 1 


Error of Closure: Canada, Provinces and Territories, 
June 1971, 1976, 1981 and 1986 


Percent Error! 
Geographic Area ee es ee ae ee 
1971 1976 1981 1986 


Canada 0.51 0.58 -0.25 0.95 
Newfoundland Os? -0.19 [eS AAU: 
Prince Edward Island -0.76 1.58 -0.31 1.06 
Nova Scotia -2.45 0.93 -0.03 1.28 
New Brunswick -0.44 ites)! -0.28 Ley 
Quebec 0.08 0.10 -0.58 1.34 
Ontario 1.41 1.07 0.37 0.73 
Manitoba -0.01 | 0.83 OS7 
Saskatchewan 0.21 0.91 -0.52 1.06 
Alberta 0.31 -0.09 -2.41 0.81 
British Columbia 0.47 0.07 -0.22 0.58 
Yukon -6.63 -2.34 -2.11 -4.66 


Northwest Territories 3.14 -0.92 -5.60 -1.32 


1 Population Estimate - Census Count x 100 


Census Count 


Source: Demography Division, Statistics Canada. 


relatively large ‘‘overcount’’. Prompted by the booming oil-based economy, a great number 
of transient job-seekers from other provinces made their way to Alberta, some of whom may 
have been incorrectly enumerated as this province’s usual residents. Yet, the fact that for 1981 
Alberta showed an above-average undercount (2.54%) only adds to the puzzle. The other 
possible explanation is that the flow of in-migrants to Alberta, in those days of its economic 
prosperity and demographic boom, was not fully captured by the family allowance and taxa- 
tion files - the basis of interprovincial migration estimates. In other words the large shortfall 
in the 1981 estimates of population might have resulted from an understatement of the net 
migration to Alberta. 

Having demonstrated that the gap between estimates and counts widened significantly in 
1986, the question to be addressed in the subsequent sections is whether this is due to the 
deterioration of: (a) the census coverage or (b) the data on the components of population growth 
over the last intercensal period. 


3. DEMOGRAPHICALLY-DERIVED UNDERCOVERAGE RATE 


By adjusting the census base population for undercoverage as estimated from the RRC, and 
by adding the net population increase (births, deaths and migrants) over the subsequent 
postcensal period, one obtains, as per equation 3, the population at the time of the next census. 
We shall call this the expected population, to differentiate it from the estimated and enumerated 
populations dealt with in the previous section. 


A 


Pj = [Res a 0.-5| Ts Giese: (3) 
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where: 
P; = expected population at time ¢; 
R;_; = enumerated population at time t—5; 
U,_; = the number of individuals missed in the census ¢—5, as estimated through 


the Reverse Record Check (RRC); 


G,_s = estimates of net population change over the intercensal period t—5,t 
(births, deaths and migrants in equation (1)). 


The difference, U/, between the expected population, P;, and the enumerated population, 
R;, as per equation 4, can be taken here as a coverage error. We shall call this the demographic 
estimate of coverage error. 


U; = P/ — R;. (4) 


And the rate of coverage error, u/, is simply the ratio of the demographically estimated 
error of coverage, U/, to the expected population, RP; 


1204 aoe R, U; 
up = (5) 
P, P; 


For comparison, the undercoverage rate as estimated through the RRC stands as follows: 


A 


U, 
i, = ——_., (6) 
R, + U, 

How do the demographically estimated error of coverage and the RRC-estimated under- 
coverage compare? First, it should be stressed that both are subject to error and bias. The former 
is affected by: (a) the lack of an estimate of overcoverage; (b) the biases in the RRC-based under- 
coverage U at t and t—5 censuses, and; (c) the biases involved in the estimates of intercensal 
net population change G,_ 5,t> Particularly its migration component. The RRC estimate of 
undercoverage is affected by: (a) sampling error, and; (b) various biases due to tracing of 
individuals, record matching, efc. Furthermore the undercoverage rate, di, as per formula (6), 
is slightly downwardly biased because R; in the denominator includes an Overcount of 
unknown quantity. Hence, alone on these grounds, comparison between the two coverage 
measurements is far from being straightforward. 

But there are conceptual differences as well. The RRC estimate is a pure undercoverage 
measurement. Demographically estimated coverage error is amore complex, difficult to define 
unequivocally, entity. It is neither an undercoverage nor a net undercoverage. In order, to better 
grasp the relationship between the two, the equation (3) of the expected population, P/, may 
be rewritten as per (7). Note that the enumerated population, R, is now expressed in terms of 
its two components: those who were correctly enumerated, R ‘, and those who were over- 
counted, O. 


A 


Py = | (Res te O72) Ors tins +5310 (7) 
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The undercoverage rate estimated by the demographic method as expressed in equation (5) 
now becomes: 


ae eg O15) + Oy 5 + Gilg) — (R7 +O) 
ui; = = + : (8) 
(Ri25 + O;25) + °U 25 4 G57 


It follows from (8) that the overcoverage affects both the expected and the enumerated 
populations. Consequently, the demographic rate of undercoverage reflects the combined effect 
of the undercoverage per se and the difference in the overcoverage, 0, of the base census, ¢—5, 
and terminal census, ¢. Assuming that both (a) the RRC-based undercoverage, Uattandt—S, 
and (b) the population change (the net sum of the components) for intercensal period, Ge 
are correctly estimated, then the demographic coverage rate, u/, and the RRC rate, d,, will 
vary numerically depending on the level of the overcoverage of censuses at time, ¢—5 and ¢, 
so that if O, =O,_s then a, =u/. 

Having clarified the conceptual particularities of the two measures of coverage error, we 
now turn to Table 2 which presents for Canada the coverage estimates for the 1981 and 1986 
censuses. Both estimates reveal a significant increase in the coverage error in the 1986 Census. 
However, the demographically-derived rate of coverage error is consistently lower than the 
RRC rate of undercoverage: 2.82% and to 3.21% for 1986, and 1.70% and 2.01%, for 1981, 
respectively. This could mean that the overcoverage was higher in 1981 than in 1976, and higher 
in 1986 than in 1981, on the condition that the assumptions underlying the identities are cor- 
rect. But there are no data to either confirm or deny the validity of these assumptions. 

The estimates of coverage error by the two methods - demographic and RRC - by province 
in Table 2 are portrayed by Figure 1(a) and 1(b). The explanation of the differences at the pro- 
vincial level is liable to present even greater uncertainties because the error and biases, 


Table 2 


Demographic and Reverse Record Check Estimates of Undercoverage Rates: 
By Provinces, 1981 and 1986 


Demographic Method Reverse Record Check! 
Geographic Area 1981 1986 1981 1986 
(%) (%) (%) (%) (%) (%) 

Canada 

(Territories not 

included) 1.70 2.82 2.01 (0.09) RP) (0.12) 
Newfoundland 2.29 3.60 1.74 (0.95) 2.01 (0.32) 
Prince Edward Island 0.05 210 WAL (0.54) PAG (0.80) 
Nova Scotia 0.82 ay id 1.05 (0.34) 2.63 (0.38) 
New Brunswick 1.83 3.28 1.81 (0.30) 2.83 (0.36) 
Quebec 2:31 SN 1.91 (0.21) 3.06 (0.29) 
Ontario 1.81 2.53 1.94 (0.14) 3.40 (0.19) 
Manitoba 1.88 1.44 0.98 (0.35) 2 de (0.40) 
Saskatchewan 0.76 2.00 0.99 (0.37) 25) (0.36) 
Alberta -1.18 3.09 2.54 (0.36) pg es (0.33) 
British Columbia 2.62 3.39 316 (0.33) 4.49 (0.39) 


1 Figures in brackets are Standard Deviations. 
Source: Demography Division, Statistics Canada. 
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referred to above, at these levels are expected to be larger than they are at the national level. 
This is true in particular for sampling error in the case of the RRC undercoverage estimates, 
and for the biases in the interprovincial migration affecting net intercensal population change 
in the case of the demographic estimates of coverage error. 

With the above comments regarding the biases and conceptual differences in mind, let us 
see how consistent are the two coverage measures at the provincial level? To this end, the 
following criterion of consistency is posited: if the two measures of coverage were conceptually 
identical and empirically correct, their respective correlation points in space should line up along 
the 45° bisectrix. 

For the 1981 Census, disregarding the special case of Alberta referred to earlier (and also 
P.E.I. heavily affected by the sampling error), the correlation points follow closely the 
theoretical 45° straight line. The discrepancies are small: in most cases they are not statistically 
significant given the standard deviation affecting the RRC estimates (see Table 2). 

For the 1986 Census, six provinces out of ten (Saskatchewan, Nova Scotia, Prince Edward 
Island, Quebec, Alberta and New Brunswick) have their respective points falling within close 
range of the 45° bisectrix and thus meet the consistency test. One, Newfoundland, falls far 
afield on the left side, suggesting a possible understatement of the RRC undercoverage rate 
for this province. Manitoba, Ontario and British Columbia fall well to the right side of the 
45° bisectrix suggesting a possible overstatement of the RRC undercoverage or understatement 
of demographic coverage rate. 

It should be stressed once again that the analysis of the accuracy of census coverage has 
been hampered by the lack of information on overcoverage. Yet, it is fair to say that not- 
withstanding its limitations, the analysis strongly points to a deterioration of the 1986 census 
coverage. 


4. CENSUS AND COMPONENT-BASED INTERCENSAL 
POPULATION CHANGE: A CHECK FOR CONSISTENCY 


The task now at hand is to compare two sets of independent estimates of the intercensal 
net population change: one set based on demographic components (births, deaths and migra- 
tion), the other set derived from two consecutive censuses, unadjusted and adjusted for under- 
coverage. Refer to the former as component-based estimates and to the latter as census-based 
estimates of intercensal net population change. 


Gr 51 = Br_s,1 = SD Ply oe ss Lee | = ee a NSF (9) 
G51 = R, — Ri-s (10) 
Gi5p = (Ret Oem (Ris + Opes). (11) 


All the above notations have been made explicit in the previous formulae. 

Two independently-produced estimates might be construed as reasonably trustworthy if they 
are similar for a given point in time. As seen in Table 3, the difference between census-based 
and component-based estimates is only about 5% for the 1976-81 period. For the 1981-86 
period, the two estimates differ by a substantial margin of 19% if unadjusted, and by 8% if 
adjusted for undercoverage. 
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Figure 1. Relationship between Undercoverage Rates as Estimated by Reverse Record Check and 
Demographic Method, 1986 Census 
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Figure 2. Relationship between Undercoverage Rates as Estimated by Reverse Record Check and 
Demographic Method, 1981 Census 
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Table 3 


Ratio Between Census and Component-Based Intercensal Change in Population: 
By Province, 1976-81 and 1981-86 
teins ees 
Ratio between Census and Component-based 
Intercensal Population Change Multiplied by 100 


} 1981-86 1976-81 
Geographic Area a ee ee eee See 
Not adjusted Adjusted for Not adjusted Adjusted for 
for Census Census for Census Census 

Undercoverage Undercoverage Undercoverage Undercoverage 

eg ne GOSe ED NEE ERIE ARON a ek B Tee inser hg saee 

Canada 80.9 108.4 104.5 106.1 

(Territories not included) 

Newfoundland =/ 19.6 58.3 80.9 
Prince Edward Island 76.7 101.6 109.8 NBs 7 
Nova Scotia 70.4 110.3 101.3 bide 
New Brunswick 55.6 86.7 PRS 99.2 
Quebec 54.2 97.1 12287 83.8 
Ontario 88.1 WIS ie? 92.0 103.2 
Manitoba 89.2 | Ba 35.6 29.0 
Saskatchewan 79.4 110.3 112.0 105.5 
Alberta 88.8 94.5 115.6 124.4 
British Columbia 89.5 118.3 102.2 105.8 


Note: The procedure cannot be applied for the period 1971-76 because, for this and earlier periods, emigration has 
been estimated residually from the two consecutive censuses and the remaining growth components (births, 
deaths and immigrants). 

Source: Demography Division, Statistics Canada. 


The comparison by province is a more delicate matter. On the components side, one has 
to contend with the reliability of the interprovincial migration estimates. On the census side, 
one must reckon with the variability of biases in undercoverage and overcoverage, and sampling 
errors in the RRC undercoverage estimates. Sampling errors alone could account for up to 
15% of variations in the ratio between the two estimates of the intercensal population change 
for some provinces. Any variations beyond this level are more likely to have been induced by 
errors and biases from other than the sampling. 

Hence, in the absence of a more trustworthy criterion, we have set +15% as a tolerance 
limit for the discrepancies between the two estimates. The tolerance limit thus set, has at least 
the merit of screening out highly questionable cases. 

With these qualifications in mind, let’s turn to Table 3, which compares by province, census 
and component-based population changes for the last two intercensal periods. Six provinces 
out of ten for the 1976-81 period, and four out of ten for the 1981-86 period meet the some- 
what arbitrarily set tolerance test. In general, the discrepancies are wider for the 1981-86 period 
than for 1976-81. Particularly conspicuous in this regard are the provinces of Newfoundland, 
Quebec, and New Brunswick. 

Newfoundland’s census-based 1981-86 population change represents only 5% of that derived 
from the components. It is still only 19%, even after adjustment for undercoverage. Such a 
low population growth would call for a net migration loss of about 26,000 over the 5-year 
period. Yet, all the three sources of interprovincial migration (Family Allowance, Taxation 
and the census mobility question) place these losses in the range of 14,800 to 16,500 (see Table 5). 
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Similar inconsistencies are found in the case of Quebec. The census-based population growth 
for the period 1981-86, which represents only 64% of the component-based growth, would 
imply Quebec’s loss through out-migration to be twice the amount estimated by Statistics 
Canada, that is, 160,000 instead of 80,000. Yet again, all the three sources of information put 
the net-migration losses in the range of 63,000 to 81,000 over the 5-year period. The gap between 
the two estimates of intercensal change is almost wiped out when the 1981 and 1986 census 
counts are adjusted for undercoverage. 

The case of New Brunswick is similar to that in Quebec and Newfoundland. The census- 
based estimate of population growth for the 1981-86 period suggests a net loss through out- 
migration of 11,200, whereas the family allowance-based figure is 2,200. The census mobility 
question and taxation figures are even lower, 1,376 and 65, respectively. Adjustment for under- 
coverage would bring New Brunswick’s two estimates of the intercensal population change 
well within the tolerance limit. 

What, then, can be concluded from the above analysis regarding the intercensal popula- 
tion change? It appears that both the components and the census generate reasonably consis- 
tent estimates of population change for the 1976-81 period. The discrepancies are small, within 
a tolerable limit for Canada and for most of the provinces. This, however, is not the case for 
the most recent intercensal period, 1981-86. Something seems to have deteriorated and the ques- 
tion remains as to whether it is the census or the components of population growth. As was 
seen in the preceding section, the 1986 Census experienced a significant increase in under- 
coverage estimated by two different methods. Adjustment for undercoverage, however, did 
not always produce better estimates of intercensal population growth, in fact the opposite hap- 
pened in some cases. In the next section, we take a closer look at the components of popula- 
tion growth. 


5. HOW GOOD ARE THE COMPONENTS OF 
POPULATION GROWTH? 


What follows is a brief assessment of the quality of the data on births, deaths, immigra- 
tion, emigration, and interprovincial migration. For a more complete account of the data on 
those components, and methodologies for estimating migration, the reader is referred to the 
1987 Statistics Canada publication ‘‘Population Estimation Methods, Canada’’. 

The registration of births and deaths is deemed to be complete in this country. Deaths or 
births that somehow escape registration must be by necessity very small in number in view of 
the prevailing regulations (need for a burial certificate) and the material (family allowance) 
incentives and legal requirements for registering births. Some late registration may occur, but 
the numbers are small. For the 1981-85 period, 3,831 or 0.02% of all births and 2,528, or 0.03%, 
of all deaths were registered beyond the cut-off date. This makes a net of only 1,303 persons 
unaccounted for in the population estimates. 

Immigration statistics are regarded as reasonably accurate to the extent one speaks here of 
landed immigrants. The distribution of immigrants by province is based on their intended 
destination rather than on where they actually settle. It is, however, noteworthy, as per Table 
4, that this distribution closely agrees with the 1986 Census distribution of immigrants. 

Compared to the three other components reviewed above - births, deaths and immigration 
- interprovincial migration and emigration are weaker links in equation (1) which is used for 
estimating population for postcensal years. There are indeed no direct records of internal migra- 
tion or emigration. Such figures must be estimated indirectly from administrative files 


Survey Methodology, December 1988 167 


Table 4 


Percentage Distribution of Immigrants by Province Based on the 1981 Census 
and Immigration Records of Intended Destination in 1980 
I a ge 


Geographic Area Immigration Records Census 
Newfoundland 0.4 0.3 
Prince Edward Island 0.1 0.1 
Nova Scotia el He 
New Brunswick 0.8 0.8 
Quebec | oe 15.0 
Ontario 43.5 42.7 
Manitoba 5.4 5.4 
Saskatchewan 2.5 2.6 
Alberta 1332, 14.5 
British Columbia + Yukon 

+ Northwest Territories Np 17.6 
Canada 100.0 100.0 


Source: Demography Division, Statistics Canada. 


~- family allowance and income tax - which contain information on changes of residence. They 
deserve, therefore, more than a cursory consideration. In what follows, we shall focus on the 
significant methodological and data improvements achieved in recent years, as well as address 
certain persistent shortcomings inherent to these estimates. For a more complete account see 
Chapters IV and V of the Population Estimation Methods, Canada, 1987. 

While family allowance data have been used since 1956, the most significant innovation to 
the system for estimating interprovincial migration was the addition of personal income tax 
data in 1976. As of 1981, a ‘‘two-track’’ estimation system was implemented: the preliminary 
quarterly and annual estimates based on family allowance data, and the fina/ annual estimates 
based on taxation data. Both these data sources have strengths and weaknesses. 

The main advantage of the family allowance file lies in its timeliness and fairly high accuracy. 
The information on change of address is available two months after the fact. The accuracy 
of the file is contingent upon two factors. The first is the comprehensiveness of coverage of 
child population, as every child under 18 years of age, supported by a parent, is entitled to 
a monthly payment. The second is the financial incentive for the beneficiaries of family 
allowances to report any change of address as soon as it occurs. The family allowance file does 
not, however, provide information on adult migration. This has to be estimated indirectly, by 
applying a conversion factor, “‘f’’, which is obtained by calculating the ratio of the adult migra- 
tion rate to the child migration rate from the taxation data available for the most recent year. 

Given the key importance of the f factor in the estimation formulae, a few comments are 
called for. Prior to 1971, the value of f was based on 5-year migration data from the most recent 
census. As the annual age-specific data on migrants became available from income tax records, 
the decision was made to use such data since they have an advantage over census data in that 
they reflect a more recent age pattern of migration. 

Another innovation is worth mentioning. Prior to 1981, the f factor was calculated only 
by province of origin. However, with the availability of relevant data from taxation, it became 
evident that this factor also varies significantly by province of destination. Consequently, the 
decision was made to calculate the f factor by both province of origin and province of 
destination. 
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Turning now to the personal income tax file as the data source for estimating interprovin- 
cial migration, the following assessment is in order. As compared to the family allowance file, 
the taxation file has the advantage of having a much broader demographic base: tax filers and 
their dependents represent roughly 90% of the population. However, there are various sources 
of potential errors and biases. Information on tax filers’ dependents must be imputed from 
the dollar value of total exemptions. Various assumptions have to be made in imputing the 
migratory status of the tax filers’ dependents, as well as that of persons who are neither filing 
income tax returns, nor are dependents upon those who do so, and therefore are not covered 
at all by the taxation system. This is particularly the case for young adults and the elderly, who 
may be more prone to neglect to file their tax-return or who may not earn the minimum income 
required for filing. Such differential age-related biases, if indeed present, affect the estimates 
of the age structure, and this in turn affects the value of the f factor, used in the family 
allowance-based preliminary estimates of interprovincial migration. 

Table 5 presents figures on net interprovincial migration for the intercensal 1981-86 period 
based on family allowance, taxation, and the census question on residence five years ago. Not- 
withstanding some significant variations in numbers, the three sources of data provide a con- 
sistent picture of level of interprovincial net migration over the 5-year period, by province. 

What has been said about interprovincial migration also holds for emigration - Canadians 
taking residence in another country. Prior to 1981, the aggregate emigration to countries other 
than the United States and the U.K. (for which data were available through the immigration 
services of the two countries) had to be estimated residually from consecutive censuses and 
the components of intercensal population growth. As of 1981, the estimation of the number 
of emigrants has been based on family allowance and income tax data. The procedure is similar 
to that described above for estimating interprovincial migration. Child-migration is estimated 
from family allowance data. To estimate adult emigration, and hence total emigration, a con- 
version factor, f, based on income tax data, is applied to child-emigration. This same pro- 
cedure applies to both the preliminary and final estimates of emigration, except that in the latter 
case more complete data are used. 


Table 5 


Net Interprovincial Migration for the Period 1981-1986, 
Based on Specified Sources 


: 1986 Family Income 
Geographic Area Census! Allowance Tax 
Canada 0 0 0) 
Newfoundland -16,550 -14,837 -15,051 
Prince Edward Island 1,540 293 751 
Nova Scotia OS 5,204 6,895 
New Brunswick -1,370 —2,239 -65 
Quebec -63,295 -76,040 -81,254 
Ontario 99,355 115,497 121,762 
Manitoba -1,555 -3,700 -2,634 
Saskatchewan -2,820 -668 -2,974 
Alberta -27,665 -34,073 -31,676 
British Columbia 9,500 13,289 7,382 
Yukon -2,665 -2,381 —2,775 


Northwest Territories -755 -345 -366 


1 Population of 5 years and over. 
Source: Demography Division, Statistics Canada. 
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Table 6 


Estimates of Emigrants by Different Methods, Canada, 1981-86 


eee 
Method 1981-86 


a ee ene IE A OFCUG ING (RBIS HHGs WISTS Ti) 
Residual Method from Censuses 


(a) Unadjusted for Undercoverage 476,406 
(b) Adjusted for undercoverage 134,807 
Revenue Canada Tax File LO S.272. 
Family Allowance Method (current) (using the f 

factor from the tax file) 235,481 
Family Allowance Method (proposed) (using the f 

factor from the immigration file) 275,762 
Reverse Record Check! 288,376 
Se ee es. 
1 Preliminary. 


Source: Demography Division, Statistics Canada. 


Table 6 compares, for the 1981-86 intercensal period, the estimates of emigration based on 
the family allowance files with the estimates produced by the various alternative methods. Note 
that the residually-derived emigration estimates, whether from adjusted or unadjusted census 
counts, are out of line with the more plausible estimates derived from the administrative files 
and the Reverse Record Check (RRC). 

In brief, significant enhancements have been made to the system used to estimate interprovin- 
cial migration and emigration, particularly since 1981. While it can be surmised that the overall 
quality of the estimates has improved as a result, no demonstrable proof can be adduced. The 
family allowance and income tax data are fraught with various shortcomings inherent in any 
data system that has been designed for administrative rather than for statistical purposes. 


6. CONCLUSIONS AND EMERGING ISSUES 


Statistics Canada’s population estimation system rests on two building blocks: (1) Census 
population counts, and; (2) components of population change, namely births, deaths and 
migrants. Postcensal estimates are carried forward by adding the components of population 
change over the subsequent years, to the base population, provided by the census. They are 
revised retrospectively when the next census counts become available. Thus, the census counts 
are both the base for the postcensal estimates, and the standard for their post-facto valida- 
tion. The system has produced timely, reliable and internally consistent population estimates, 
and over the years has enjoyed a remarkable stability. 

Much of its stability can be attributed to the high quality of the Canadian censuses. For 
Canada as a whole, undercoverage as measured by the Reverse Record Check (RRC) remained 
almost unchanged, at close to 2%, for three consecutive censuses — 1971, 1976 and 1981. Hence, 
even if the census fell somewhat short of the ‘‘true’’ population of Canada, it provided a highly 
reliable basis for gauging population growth. 

The 1986 Census marks, however, a departure from the trend, as the rate of undercoverage, 
estimated by the Reverse Record Check, rose to 3.2%. The 1986 Census understates the popula- 
tion increase over the 1981-86 period by about 20%, if one accepts the component method as 
the standard of validation. Both the Reverse Record Check and the demographic analysis cor- 
roborate the deterioration of census coverage in 1986. 

On the population components side of the equation - the other building blocks of the estima- 
tion system - records on births, deaths, and landed immigrants are fairly reliable. The inter- 
provincial migration and emigration estimates have benefited from various data and 
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methodological enhancements, particularly since 1981, as was explained in the preceding sec- 
tion. But, as was also pointed out, they may suffer from various shortcomings inherent in any 
data sources - such as the family allowance and taxation files - that have been designed for 
administrative rather than statistical purposes. The estimates of interprovincial migration and 
emigration remain, along with census undercoverage and overcoverage, the prime sources of 
possible errors and biases in the postcensal estimates of population by province. 

What does the future hold for the estimation system as described above? Can it continue 
working as it stands, or does it need some major reconceptualization? The apparently higher 
undercoverage rates of the 1986 Census, and its potential consequences for population 
estimates, has prompted the discussion of an alternative to the present census-based method 
of producing estimates. This alternative would no longer necessarily rely on the most recent 
census as a bench-mark, but instead would use relevant available information, including census 
counts, undercoverage and overcoverage, as well as administrative records, to generate the 
‘‘best’’ possible estimates. In other words, the census counts remain an important ingredient 
of the estimation process, but not the overriding one; nor would the most recent census 
necessarily be used, if, say, the counts from the previous census were deemed to be more reliable. 

After careful consideration, Statistics Canada has decided that the 1986 Census (unadjusted 
for undercoverage) would be used for the 1986 postcensal estimates and revision of the estimates 
for the 1981-86 intercensal period. In other words, the existing estimation procedures were 
reconfirmed. But at the same time, it was recognized that the evaluation of the census and 
estimates needed to be stepped up, and that an estimation strategy for the post-1991 Census 
period needed to be devised. Such an estimation strategy would have to take into account plans 
and realistic prospects for improvements and enhancements in the following four areas: 


(1) 1991 Census coverage; 
(2) Measurement of both undercoverage and overcoverage; 


(3) Administrative records used for the purpose of population statistics: enhancement of the 
currently used sources - Family Allowance and Taxation - and the harnessing of new ones, 
such as Old Age Security and Provincial Health Care Files; 


(4) Estimates of migration, particularly those concerning interprovincial migration, returning 
Canadian residents after a protracted stay abroad, and emigration from Canada. 


These raise some fundamental issues concerning the philosophy and policy that ought to 
govern the working of a statistical system, thus transcending the rather narrow question of 
adjustment for undercoverage referred to at the outset of this paper. In the census-based con- 
ception, the emphasis is on the stability and internal coherence of the estimation system. In 
the conception of a census-divorced estimation model, a premium is placed on flexibility so 
as to increase the accuracy of the estimates through the utilization of the relevant available 
information, but possibly at the price of methodological consistency over time. The resolu- 
tion of the dilemma between these two conceptions will be greatly influenced by the progress 
that is achieved in the four areas of statistical endeavour identified above. 
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Adjusting the 1986 Australian Census Count for 
Under-Enumeration 


C.Y. CHOI, D.G. STEEL and T.J. SKINNER! 


ABSTRACT 


In Australia, population estimates have been obtained from census counts, incorporating an adjustment 
for under-enumeration in 1976, 1981 and 1986. The adjustments are based on the results of a Post 
Enumeration Survey and demographic analysis. This paper describes the methods used and the results 
obtained in adjusting the 1986 census. The formal use of sex ratios as suggested by Wolter (1986) is exam- 
ined as a possible improvement of the less formal use made of these ratios in adjusting census counts. 


KEY WORDS: Census under-enumeration; Post-enumeration survey; Demographic estimates; Sex-ratios. 


1. INTRODUCTION 


The population census provides the basic information from which estimates are made of 
the population of the nation, each of the eight States and sub-State local government areas. 
In Australia, these population estimates are required for the determination of the number of 
seats each State will have in the Federal House of Representatives, the allocation of funds to 
each State, and the funding of local government authorities. Population estimates are also used 
in their own right as indicators of population growth and distribution and as denominators 
for various demographic, social and economic indicators. Because population estimates are 
used in such important ways, a high level of accuracy is required. 

In Australia, it is known that the level of under-enumeration at the census is significant and 
that this level is related to important variables such as birthplace, geographic area and age/sex. 
Because of this, an adjustment for under-enumeration is made to census counts used for popula- 
tion estimates. 

The adjustment of census counts for under-enumeration is a recent practice in Australia. 
Prior to the 1976 Census, census counts without adjustment for under-enumeration were used 
directly for population estimation purposes. The need to make this adjustment was recognised 
when the 1976 Census count fell considerably below the population estimates for the 1976 
Census date which were updated from the 1971 Census, and when the 1976 Post Enumeration 
Survey (PES) showed a high under-enumeration rate of 2.6 per cent compared with 0.5 per 
cent in 1966 and 1.3 per cent in 1971. The 1976 PES also showed significant variations in under- 
enumeration between States and Territories, ranging from 4.2 per cent for the Northern Ter- 
ritory to 1.1 per cent for Tasmania. In 1986, the level of under-enumeration is estimated to 
be 1.9 per cent. As in 1976, there were significant variations between States and Territories. 
The adjustment of 1976 and subsequent census counts has been well received and no challenges 
have been raised to the appropriateness of doing so or the accuracy of the methods used. This 
is in contrast with the high level of controversy experienced in the United States of America 
on the appropriateness of making adjustments to the 1980 census counts for under-enumeration. 


Cy. Choi, D.G. Steel and T.J. Skinner, Australian Bureau of Statistics, P.O. Box 10, Belconnen, ACT, 2616, 
Australia. 
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Data for the assessment of the level of under-enumeration are primarily derived from a census 
PES. Results of the PES are assessed by comparing these with estimates based on demographic 
statistics and other independent data such as statistics on school enrolments, on children whose 
parents receive government family allowances, and on persons registered with the government 
Medicare insurance system. In Australia, school enrolments for children aged 6-15 years are 
compulsory and until means-testing was introduced in November 1987, family allowances had 
been universally paid to mothers of all children of ages less than 17. Medicare insurance is also 
compulsory and universal for all residents. These independent statistics are therefore helpful 
as a check of the PES results and demographic estimates. 

Although population estimates include an adjustment for under-enumeration, no adjust- 
ment is made for other census data. Census counts are published without adjustment. 


2. THE 1986 POST-ENUMERATION SURVEY 


In its five yearly population census, the Australian Bureau of Statistics (ABS) employs census 
collectors for the delivery of forms to each household and for the collection of completed forms 
from each household. The census is conducted on the basis of enumerating people where they 
are located on census night. 

This collector-based field system allows the census collection phase to be completed two 
weeks after the census date. This allows a census PES to be conducted reasonably close to the 
census date - in 1986 within 4-5 weeks of census night. Because the PES asks a number of ques- 
tions requiring detailed answers referring to a person’s location on census night, its conduct 
close to census date minimises recall error and also reduces the number of exclusions due to 
deaths and overseas travel. 

As the PES provides the basis for adjusting the census counts for under-enumeration, it 
is important that the PES be statistically independent of the census. The Appendix describes 
the steps taken to ensure independence. 

The basic approach adopted in the 1986 PES was to select a sample of people independently 
of the census through a multi-stage area sample of private dwellings. The informativun required 
of each person in the selected households was obtained by personal interview of any respon- 
sible adult by trained field staff from the ABS regular interview panel. Matching of PES and 
census records to determine whether each person in the sample should have been included in 
the census and how many times the person was in fact included was undertaken by clerical staff 
employed in the Census Data Transcription Centre. The procedures used are described in the 
Appendix. 

From the survey, the ratio of the number of persons who should have been included in the 
census (x) to the number of persons who were estimated to have been in fact included (y) can 
be estimated. This ratio is the net adjustment factor which accounts for both over and under- 
enumeration of individuals. 

This adjustment factor, after weighting, is then applied to the actual census count (Y) to 
produce an estimate of the population (X), 7.e. X = Y (Xx/y). 

To allow for differences in expected and actual sample take in the PES, this procedure was 
applied at the age (5 year groups), sex and geographic area (capital city statistical division/rest 
of State) level. PES estimates are produced on both an actual location at the census date and 
usual residence basis. The estimation also includes an adjustment for the small level of non- 
contact and non-response in the PES. For example the estimate of usual residence population 
for geographic area (s) and age sex cell (a) is: 
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In these estimation formulae the subscript c denotes the response status of the PES dwelling 
in the census and the subscript g denotes the geographic area in which the person was selected 
in the PES. D,. is the number of responding dwellings and d,- is the number of non- 
contact/non-responding dwellings in area g and census response category c. The sampling frac- 
tion varies between states and is denoted des 

In this form the estimator is a post-stratified ratio estimate. Ignoring for the moment that 
people may be enumerated in the census incorrectly or more than once, the estimator is the 
estimator obtained from a dual-record system or a capture-recapture approach discussed, for 
example, in Bishop, Fienberg and Holland (1975, pp231-234). This is shown in the diagram 
below where under the assumption of independence the estimate of the total population is Y 
(x/y) which is the ratio estimate X. 


PES 


Counted 


en env ie 


Census 


The 1986 PES, however, was designed to collect information on both the number of persons 
missed by the census and the number of persons over-enumerated, i.e. included in the census 
erroneously or included more than once. The estimate X takes into account both over and under- 
enumeration at the same time. In this respect, the approach adopted is different from the tradi- 
tional capture-recapture methodology. 

Variance estimation was based on treating X as a ratio estimate derived from a multi-stage 
sample. The relative standard errors on the PES estimates of the population are given in Table 
1. From this table and tables 2 and 4 we see that standard errors are considerably less than 
the adjustments implied by the PES national age by sex estimates and State by sex estimates. 
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Table 1 


1986 Census: Relative Standard Errors of PES Estimates 
of the Population 


a  ——————— 


Age Males Females Persons 
0% % % 
0- 4 0.29 0.36 0.24 
5- 9 0.29 0.30 0.22 
10-14 0.28 0.29 0.21 
15-19 0.32 0.32 0.24 
20-24 0.49 0.43 0.34 
25-29 0.49 0.36 Or32 
30-34 0.39 0.34 O27 
35-39 0.36 0.30 0.24 
40-44 0.38 0.32 0.26 
45-49 0.37 0.30 0.25 
50-54 0.43 0.38 0.30 
55-59 0.38 0.30 0.25 
60-64 0.41 0.38 0.29 
65-69 0.43 0.37 0.29 
70-74 Of53 0.41 0.34 
75+ 0.47 0.39 0.31 
All ages 0.12 0.10 0.08 
State Males Females Persons 
% % % 
NSW OT 0.18 0.14 
VIC 0.23 0.21 0.16 
QLD 0.27 0.24 0.19 
SA 0.27 0.20 0.17 
WA 0.29 0.25 0.19 
TAS 0.36 0.31 0.25 
NT 1.65 1hesys) 122 


ACT 0.61 0.74 OESS 


For a more detailed description of the 1986 Post-Enumeration Survey and the estimation 
procedures, see Appendix. 


3. DEMOGRAPHIC ESTIMATES OF CENSUS UNDER-ENUMERATION 


An alternative method for the estimation of census under-enumeration is through the use 
of past demographic data including those from previous censuses, births and deaths registers, 
and overseas migration statistics. For example, estimates of the population at a certain date 
can be made by updating a previous census using data on births, deaths and overseas migra- 
tion. The more distant is the previous census which serves as the base, the longer is the time 
series of reliable vital and migration statistics required, and the less reliance there needs to be 
on the accuracy of the census base. This is because estimates of persons born after the rele- 
vant census date will be affected only by the reliability of data on births, deaths and migra- 
tion. Internal migration data in Australia are not sufficiently reliable to enable the use of 
demographic methods for estimating census under-enumeration at sub-national levels. Use of 
demographic estimates for census evaluation is therefore limited to Australian totals. 
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Australian data on births and deaths are available as a time series going back to the 19th 
century and it is unlikely that there have been significant omissions. Successive reports by the 
Australian Commonwealth Statistician after each population census from 1911 to 1961 claimed 
that the registration of births and deaths in Australia was substantially complete although it 
was recognised that some omissions were possible and that there were time lags in registra- 
tions. The Statistician’s Report was discontinued after 1961. However, there is no evidence 
that the level of coverage of birth and death registrations has deteriorated since then. 

Australia has also maintained comprehensive and reliable statistics on overseas arrivals and 
departures over a long period of time. These statistics cover all movements including perma- 
nent, long term and short-term movements. However, there are several deficiencies in the 
statistics on Overseas arrivals and departures which limit their usefulness for the evaluation 
of the census data. First, there have been periods in the past when arrivals and departures were 
suspected of being inaccurately recorded (e.g. during World War II and the period immediately 
following the war). Second, because of the increase in overseas short-term movements since 
the 1960’s only a sample (of about | in 20) of the arrivals and departure records has been pro- 
cessed for statistical purposes since 1971. Third, errors can occur in the classification of 
travellers into permanent, long-term and short-term categories. To avoid these errors of 
classifications the comparison of demographic estimates, census counts and PES estimates of 
the population at census date is made on the basis of actual location, which include all three 
categories of overseas movements. 

For the assessment of under-enumeration at the 1986 Census, demographic estimates of the 
population as at census date 1986 by age and sex were made using births, deaths and overseas 
migration data going back to 1921 together with results of the 1921 Census. Demographic 
estimates of the population to age 65 years are therefore based solely on birth, deaths and migra- 
tion data and would not be affected by the accuracy of the 1921 Census. 


4. VALIDATION OF THE 1986 PES ESTIMATES 


The following table shows the estimated population as at 30 June 1986 by age and sex based 
on demographic analysis and based on the 1986 PES. Medicare enrolments by age and sex are 
also shown. 

There is a very high level of correspondence between PES and demographic estimates of 
the male population, particularly for those aged under 30. However, there is a large discrepancy 
for males aged 30-34, the demographic estimates being 20,000 higher than PES estimates. This 
can be attributed to a large net gain in the number of males of these ages from short-term 
movements into and out of Australia in the period 1981-86. Net gains from short-term 
movements of this magnitude are not detectable in the adjacent age-groups and therefore may 
reflect some error in overseas arrivals and departures statistics. With the volume of overseas 
movements being very high (over 6 million in 1986), a small error in reporting of age or in pro- 
cessing can lead to a relatively large discrepancy in the demographic estimate in net absolute 
term. The possibility of error in demographic estimates is further illustrated by the very high 
implied under-enumeration rate of 5.3 per cent for this age group compared with much lower 
rates for the surrounding age group. 

It is, of course, quite likely that under-enumeration of overseas visitors was not adequately 
measured by the PES. However, in either case, errors in estimating the visitor component of 
the population should not affect the accuracy of official population estimates because these 
are based on the concept of usual residence and do not include visitors. 
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Table 2 


Estimates of 1986 Population by Age and Sex Based on the 1986 
PES and Demographic Analysis, and Medicare Enrolment 


a 


Males (7000) 


Pupiilation Difference from Bercent Under- 
Age Census enumeration 

Census PES(a) DE(b) Medi- PES DE Medi- PES DE Medi- 

(a) care care care 
0- 4 608.3 616.4 612.8 611.4 8.0 4.5 3.1 |e 0.7 0.5 
O- 5 594.9 602.4 603.0 612.3 ee 8.1 17.4 | I: 2.8 
10-14 660.8 670.4 668.4 674.3 9.6 7.6 | ees) 1.4 1.1 2.0 
15-19 673.1 688.4 687.7 693.1 15.3 14.6 20.0 22 P| 2.9 
20-24 648.5 679.5 681.3 681.3 31.0 32.8 32.8 4.6 4.8 4.8 
25-29 649.2 677.7 675.0 688.5 28.5 25.8 39.3 4.2 3.8 527 
30-34 615.5 630.0 650.1 647.5 14.5 34.6 32.0 oe) 5.3 4.9 
35-39 G22e2 634.2 63287 646.2 12.0 10.5 24.0 1.9 eee ao 
40-44 504.2 512.6 517.0 522.3 8.4 12.8 18.1 1.6 2.5 335 
45-49 419.8 427.0 416.5 436.8 Vad -3.3 17.0 7. -0.8 3.9 
50-54 363.7 371.2 371.4 377.9 Tas il 14.2 Pra) oil 3.8 
55-59 373.4 5195 384.9 386.6 6.1 he eRe 1.6 3.0 3.4 
60-64 341.1 347.0 348.1 350.6 5.9 7.0 9.5 7 2.0 Pied 
65-69 259.6 263.6 251.8 265.5 4.0 -7.8 5.9 1.5 -3.1 2.2 
70-74 204.2 208.2 200.8 213.0 4.0 -3.4 8.8 1.9 -1.7 4.1 
75+ 229.5 233.0 181.5 250.1 ee -48.0 20.6 | -26.4 8.2 
Total 7768.3 T94V-0"" *7883a 8057-3 jee 114.8 289.0 22 |e 3.6 

Females (’000) 
Bepulaion Difference from Percent Under- 
Age Census enumeration 

Census PES(a) DE(b) Medi- PES DE Medi- PES DE Medi- 

(a) care care care 

0- 4 379.7 591.0 583.8 $80.9 Lies 4.1 je? 1.9 0.7 0.2 
5- 9 565.1 572.4 565.5 582.1 73 0.4 17.0 PS 0.1 2.9 
10-14 628.0 636.8 630.2 641.8 8.8 22 13.8 1.4 03 2 
15-19 644.1 657.4 651.4 666.3 133 Mes) 22 2.0 Lot 3.3 
20-24 633.1 652.5 644.4 670.4 19.4 Li Sn Sat 1.8 3.0 
25-29 648.7 660.7 665.4 684.1 T2220 16.7 35.4 1.8 2 Sz 
30-34 618.1 627.8 631.2 643.9 9.7 erg! 25.8 155 2.1 4.0 
35-39 612.1 619.1 600.2 626.3 7.0 -11.9 14.2 Lak -2.0 gS, 
40-44 482.6 488.6 489.6 495.4 6.0 7.0 12.8 £2) 1.4 2.6 
45-49 399.1 403.0 397.9 411.6 3.9 -1.2 1255 1.0 -0.3 3.0 
50-54 349.1 354.6 343.9 358.6 535 -5.2 9.5 1.6 -1.5 2.6 
55-59 362.6 366.5 362.4 372.4 3.9 -0.2 9.8 14 -0.1 2.6 
60-64 358.2 364.4 351.3 365.3 6:2 -6.9 TH jae -2.0 1.9 
65-69 298.2 302.2 301.9 306.7 4.0 aot 8.5 13 tee 2.8 
70-74 259.0 262.9 262.2 269.7 3.9 3.2 10.7 1e5 hoz 4.0 
75+ 396.2 404.7 385.0 434.1 8.5 -11.2 37.9 p Dee | -2.9 8.7 
Total 7833.8 7964.6 7866.2 8109.6 130.8 BQIA ryy2locd 1.6 0.4 3.4 


(a) Actual location basis. 
(b) Demographic estimates based on 1921 Population Census and post 1921 demographic events. 
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Figure 1. Percentage under-enumeration at the 1986 Census: Post-Enumeration Survey, 
Demographic Estimates and Medicare Enrolment-MALES. 
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Figure 2. Percentage under-enumeration at the 1986 Census: Post-Enumeration Survey, 
Demographic Estimates and Medicare Enrolment-FEMALES. 
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For females, the level of correspondence between PES results and demographic estimates 
for ages below 35 is satisfactory. However, the demographic estimates for some age groups 
are considerably lower than PES estimates, and for those aged 35 to 39, and 45 to 64, they 
are lower than the unadjusted census count. Demographic estimates for these groups appear 
to be too low. This supports the view that demographic estimates are not sufficiently accurate 
for the production of population estimates and should be used only to assess PES results. 

PES under-enumeration rates by age show a pattern which is smooth and much less erratic 
than that shown by demographic estimates. The higher PES rates for young adults aged 20-29 
compared with those for other ages are as expected, given the higher rates of mobility among 
young adults, particularly males. 

Medicare registrations are considerably higher than PES estimates and demographic 
estimates, except for the 0-4 age-group. Studies of registration practice show that the lower 
number in the 0-4 age group for medicare registration reflects the delays in births being regis- 
tered with Medicare, and the higher numbers in other ages reflect delays in deleting from the 
Medicare register deaths and persons who have emigrated from the country. 

Comparisons of PES estimates with estimates from family allowance registration and school 
enrolments for selected age-groups also show satisfactory correspondence. These results give 
some confirmation of the accuracy of the PES estimates in so far as the younger ages are 
concerned. 

Although there is a satisfactory level of correspondence between PES estimates and other 
estimates of the population, there are two remaining problems which require consideration 
before the PES estimates can be accepted. The first emerges from an analysis of the PES 
estimates of census under-enumeration rates by age and sex. These rates are shown in Table 2. 

Except for those aged 0-4 and 75 + , male under-enumeration rates are generally higher than 
female rates. While the rates for those aged 75+ could be affected by small sample size, the 
rate for females aged 0-4 appears too high, 1.9 per cent compared with 1.3 percent for males 
of the same age and for females aged 5-9. The number of females aged 0-4 estimated by the 
PES to have been under-enumerated was 11,300 compared with about 7,000 for the age group 
5-9. This large difference in under-enumeration between those aged 0-4 and those aged 5-9 for 
females does not exist for males. 

The PES sex ratio for persons aged 0-4 is 104.3 males to 100 female, lower than the census 
count ratio of 104.9 and the ratio of 105.0 males to 100 females estimated from demographic 
data. 

On the above evidence, it appears that the PES has over-estimated females aged 0-4, although 
it is difficult to see how the PES could have over-estimated this group more so than other groups. 

The second problem relates to the very high PES under-enumeration rate estimated for the 
Northern Territory. As shown in Table 4 it is 9.97% on an actual location basis and 6.45% 
on a usual residence basis. Northern Territory is a sparsely populated area (the census count 
in 1986 was 154,800 in an area of 1.3 million square kilometers) with a highly mobile popula- 
tion. The PES estimate of the population of Northern Territory is considerably higher than 
that based on the 1981 Census. Comparisons of PES estimates for the Northern Territory with 
independent estimates such as the number of children on the family allowance register and the 
number of school enrolments, also show that PES estimates are high. While these indepen- 
dent estimates may very well contain errors, it appears very likely that the PES has over- 
estimated the rate of under-enumeration for the NT. 

The PES questionnaires were checked for the Northern Territory and were found to be 
satisfactory except for one collection district where problems with unreliable addresses and 
difficult terrain exposed inadequacies in field procedures and led to difficulties with matching. 
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Table 3 
Comparison of 1986 PES Results with Independent Estimates 


PES Demographic Family School 
Estimates Estimates Allowance Enrolment 


Persons (’000) 


0- 4 1207.3 1196.6 1204.8(a) = 
5- 9 1174.8 1168.5 1177.0 -(b) 
10-14 1307.2 1298.6 1304.2 1289.6 


(a) Family allowance registration for age 0 is understated because of the time lag in births being 
registered for family allowance. An adjustment was made by substituting the family allowance 
figure for age 0 by an estimate from the demographic analysis. 

(b) School enrolment not compulsory for children aged 5 years. 


Table 4 
PES Under-Enumeration Rates (%) by State 


Actual Usual 
location Residence 
basis basis 
New South Wales 154 lel 
Victoria 1.59 eT 
Queensland 2.68 2.43 
South Australia 1.54 1.59 
Western Australia 232) 2.26 
Tasmania 158372 1.16 
Northern Territory oe 6.45 
Aust. Capital Territory 1.95 1.61 
Australia 1.91 1.84 


A judgement was made that the PES over-estimation of females aged 0-4 and of the NT 
population should be corrected by adjusting the PES results. The adjustment to females aged 
0-4 was made by using the sex ratio from demographic estimates and applying this to the PES 
estimates of males aged 0-4. Essentially, this amounted to replacing the PES estimate of females 
aged 0-4 by a better estimate using the PES estimate of males and the sex ratio. The result of 
this adjustment was to reduce the estimates of this group by 4,000 to 587,000. 

The problem with the NT estimates was handled by not using data from the problematic 
collection district. This reduced the Northern Territory under-enumeration rate to 9.1 per cent 
(on an actual location basis) and 5.5 per cent (on a usual residence basis). 

The two adjustments to PES results reduced the overall national under-enumeration rate 
from 1.91 per cent to 1.87 per cent (on an actual location basis), or from 1.84 per cent to 1.81 
per cent (on a usual residence basis). Table 5 shows PES estimates by age and sex after the 
above adjustments were made to the estimates for NT and for females aged 0-4. 
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Table 5 
Census Count 1986 Adjusted for Under-enumeration by Age and Sex 


ee 


On the basis of ‘actual location’ 


i 


Males Females Persons 

Age EO oOo —_ 

Nov gfe Grae sh ouNaniet ORE eh oe Ip RE 
ny ation hl) ation £000) ation 

ee ee 
0- 4 616.3 1.30 586.6 Pi. 1202.9 1.24 
5- 9 602.4 1.24 572.4 1.27 1174.8 1.26 
10-14 670.1 1.39 636.8 1.38 1306.9 1.39 
15-19 688.3 2.19 657.3 Ze, 1345.6 2 
20-24 679.4 4.54 652.4 2.95 1331.8 3.76 
25-29 6775 4.17 660.7 1.81 1338.2 3.00 
30-34 629.9 2.29 627.8 1.55 12S 7H 1.92 
35-39 634.0 1.87 618.9 Ladd 1252.9 1.49 
40-44 512.6 1.64 488.5 jal 1001.1 1.43 
45-49 426.9 1.66 403.0 0.98 829.9 1.33 
50-54 Sy tee 2.04 354.6 1.56 725.8 1.80 
55-59 379.5 1262 366.5 1.06 746.0 1.34 
60-64 347.0 1.70 364.4 1.70 711.4 1.70 
65-69 263.6 1eS2 302.3 1435 565.9 1.43 
70-74 208.2 1.92 262.9 1.47 471.1 1567 
75+ 233.0 1.49 404.7 2.08 63351 1.86 

Se ee eS ee ee ee 
All ages 7940.1 2.16 7959.7 1.58 15899.8 1.87 

On the basis of ‘usual’ residence 
Males Females Persons 

Age — a 

No} % under No. % under No. % under 

(000) enumer- (000) enumer- (°000) enume- 
ation ation ation 
0- 4 615.3 1.29 585.9 1322 120122 1226 
5- 9 601.3 1225 57432 L222 117225 1.23. 
10-14 668.5 1.29 635.7 1.36 1304.2 1-33 
15-19 685.6 pag NI 654.3 1.97 1339.9 2.04 
20-24 673.1 4.33 646.9 2.83 1320.0 3.59 
25-29 672.6 4.02 657.2 1.80 1329.8 2.92 
30-34 626.6 221 625.6 153 1252.2 1.87 
35-39 630.9 1.78 616.7 1-05 1247.6 1.41 
40-44 510.3 1.59 487.0 1.19 997.3 1.39 
45-49 424.7 Lose 401.7 0.98 826.4 1.26 
50-54 369.6 1.97 353.0 i es 4 T2260 1.75 
55-59 Shia B52 364.0 0.92 741.8 1:22 
60-64 345.6 1.74 361.6 1.61 707.3 1.67 
65-69 262.1 147 300.2 134 562.3 1.38 
70-74 207.2 1.89 261.3 1.46 468.5 1.65 
75+ 232.4 | We 403.3 2.01 635.7 1.83 


All ages 7903.6 2.08 T9250 1.54 15829.1 1.81 
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5. ESTIMATING SUB-NATIONAL POPULATIONS 


Internal migration data are not sufficiently reliable for demographic estimates of the popula- 
tion at sub-national levels to be used to assess census under-enumeration. However, a com- 
parison of the 1986 PES estimates of the number of children aged 1-15 was made with the 
corresponding number receiving family allowance by State/Territory. This comparison shows 
a general agreement except for Northern Territory where the percentage difference was more 
than 2%. 

Given this general agreement between PES estimates and family allowance data, and in the 
absence of reliable independent data on higher ages for comparison with PES estimates, the 
PES estimates (after adjustments) of the State and Territory populations were accepted. 

Population estimates at the State/Territory level by age and sex, and at the local govern- 
ment area level were not derived directly from the PES. The 1986 PES was a sample survey 
and the results are subject to sampling error. Sampling errors at the State/Territory level by 
age and sex and at the local government area level are high, many unacceptably high, relative 
to the amounts of adjustment for under-enumeration which need to be made. An alternative 
indirect method, using an iterative proportional fitting (IPF) procedure, was used to produce 
State/Territory estimates by age and sex from those higher level PES estimates with a low 
sampling error. For a description of the IPF procedure, see Purcell and Kish (1979). This 
procedure involved taking the national population estimates by age and sex and the State/ 
Territory estimates within each sex and adjusting the census age by State/Territory counts to 
these two margins. 

The IPF procedures involves the following cycles n = 0,1, . 


X, 

2 1 as 

Ke oie) o len) 
as 

Y(2nt2)  _ y(2n+1) X os 

gas | gas F(R) 
Xs 
and X{°) = Y,,, the census count for state g, age category a and sex s. The procedure con- 
gas gas 


verges to a unique solution. The use of IPF procedures, of course, assumes that the relation- 
ship between the variables within the assocation structure is valid and that this relationship 
is preserved. 

For estimates for local government areas, the problem with high sampling error is more acute 
and results of the PES are not sufficiently reliable to make direct estimates of under- 
enumeration for each local government area. Based on the premise that under-enumeration 
is age/sex and birthplace (Australian born/Overseas born) selective, and that it differs between 
States/Territories and between capital city and the rest of the State, adjustments for under- 
enumeration at the local government area level were made to reflect under-enumeration dif- 
ferentials by age, sex, capital city/rest of State and Australian-born/overseas-born. 


6. PROBLEMS WITH THE PES ESTIMATION 


As pointed out by Bailar (1985), for example, the bias and consistency of the PES estimates 
is affected by errors in the matching process, any correlation between a person being missed 
in the census and in the PES, and erroneous inclusions in either the census or the PES. It is 
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because of the possible effects of these factors that the results of the PES are assessed using 
demographic and administrative data in the ways described above. 

Errors in matching will bias the PES estimates. Failure to match records that in fact should 
match will lead to the creation of apparently under-enumerated persons and the PES estimate 
will be an over estimate. The effect of false matches will be the reverse. 

Erroneous inclusions in either the census or PES will inflate the values of Y or x and hence 
the PES estimate. The US Bureau of the Census conducts a special ‘‘E-sample’’ selected from 
the census to estimate the extent of erroneous inclusions in the census which can then be incor- 
porated in the estimate by adjusting the census count Y. For a description of the E sample, 
see Fay, Passel and Robinson (1988). The matching and estimation procedures used by the 
ABS attempt to adjust for some of the effect of erroneous inclusions by determining not only 
whether or not someone has been included but whether they should have been included and 
if they have been included more than once. For example in the 1986 PES, 250 people were deter- 
mined to have been included twice and four persons had been included three times. Cases were 
also found where persons had been included but should not have been. In this way viewing 
the PES estimation as a ratio estimator rather than a dual system estimator enables the 
accounting for some erroneous inclusions. 

The dual system estimation method makes the assumption that whether or not someone is 
missed in the PES is independent of whether or not that person is missed in the census. Whilst 
all practical steps have been taken in ensuring that the two field and processing systems involved 
in the collections are completely separate and independent it is still possible for correlation 
to exist. Positive correlation will mean that the PES estimate based on the assumption of inde- 
pendence will be an under-estimate, negative correlation leads the PES estimate to over- 
estimate. Negative correlation would occur if being included in the census led people to be hard 
to enumerate in the PES but we have no clear evidence for this; the final response rate for the 
PES (95%) is in line with other household surveys conducted by the ABS. Positive correlation 
seems more likely, and there appears to have been some evidence of this in the 1981 Census. 
If such positive correlation exists then the PES based adjustments will have not gone far enough 
but will have been in the right direction. 


7. ALTERNATIVE METHODS OF ESTIMATION (WOLTER 1986) 


The idea of combining PES data and demographically derived sex ratios or sex ratios 
obtained from other sources is the basis of methods suggested by Wolter (1986). Wolter sug- 
gests several models and associated methods which formally combine sex ratios and PES 
estimates. These methods are attempts to loosen the assumption of independence inherent in 
the PES estimation methods. 

Wolter considers two models. In the first it is assumed that the degree of association in under- 
enumeration between the PES and the census (as measured by the cross-product ratios in tables 
such as the diagram shown earlier in this paper) is the same for males and females within each 
age category. In the second model independence is assumed for females and an externally 
derived sex ratio is used to obtain the male figure. It is then possible to calculate the cross- 
product ratios implied for males. 

From an initial evaluation of these methods applied to Australian data, it was found that 
the first model produced very erratic estimates of the cross product ratios, with approximately 
50% being negative. This was greatly reduced under the second model although some remained 
negative and were set to zero in a modified model. The problem with negative cross-product 
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Table 6 
Sex Ratios: Males per 100 Females 


Age Alternative PES 
0- 4 105.0 104.3 
5- 9 105.2 1052 

10-14 105.2 105.3 

15-19 104.7 104.7 

20-24 104.1 104.1 

25-29 102.6 102.6 

30-34 100.3 100.3 

35-39 102.4 102.4 

40-44 104.5 104.9 

45-49 105.2 106.0 

50-54 104.2 104.7 

55-59 103.0 103.5 

60-64 95.2 OSe2 

65-69 87.1 87.2 

70-74 78.8 79.2 


(Gs 27 9: 57.6 


ratios was also identified by Wolter (1986, p. 7). The second model, modified, was then applied 
to 1986 data. For age groups 5-9 up to 35-39, the sex ratio obtained from the PES were in line 
with expectations and those sex ratios were used giving exactly the PES estimate. For the 0-4 
age group the sex ratio obtained from demographic estimates was used and for the 40-44 to 
75+ age groups, an alternative estimate of the sex ratios based on census counts was used. 
The sex ratios are given in Table 6. 

The sex ratio used and the PES sex ratios are not greatly different so applying Wolter’s second 
model leads to only small changes in the PES estimates. For the 0-4 and 75 + age groups the 
estimates of males are increased by 0.7% and 0.5% respectively. For the 45-49 and 70-74 age 
groups the estimates are reduced by between 0.7% and 0.5%. This analysis suggests that the 
differences in biases between sexes in the PES estimation method due to the combined effect 
of the potential problems discussed above, are relatively small. It could be the case that any 
biases are affecting males and females to an approximately equal degree so that PES sex ratios 
are broadly acceptable. 

Our experience in 1981 and 1986 demonstrated the need to use sex ratios in assessing measures 
of under-enumeration and we believe the Wolter method is a useful way of generating alter- 
native estimates against which the Census count and direct PES estimates can be judged. The 
general acceptability of the PES sex ratios in 1986 has meant that using this method made little 
difference. The acceptability of the PES sex ratios in 1986, except for the 0-4 age group con- 
trasts with the experience in 1981, where an adjustment to the PES estimates was considered 
necessary for a number of age groups based on alternative sex ratios. These differences in the 
1981 and 1986 experience may reflect a reduction in correlation between under-enumeration 
in the census and the PES in 1986. 
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8. CONCLUSION 


While the ABS has adjusted the past three censuses for under-enumeration, our confidence 
in the basic reliability of the PES stems from its general consistency with other data sources. 
No fundamental change in approach is anticipated for the next census to be conducted in 1991. 
However, we believe there is a need to investigate further potential causes of bias, in particular 
the adequacy of the clerical matching procedures, and methods to overcome correlation bias. 
It is also planned to investigate the possibility of creating a demographic data bank on a usual 
residence basis, so that the effects of the large volume of short-term movements can be 
eliminated or reduced. 
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APPENDIX 


THE 1986 POST-ENUMERATION SURVEY 


General 


The 1986 PES was conducted in the 4th and Sth weeks after census night. The survey involved 
interviews with a sample of the population from about 35,000 private dwellings (2/3 of one 
percent of dwellings) across Australia involving about 100,000 persons. The sampling frac- 
tion varied between States and Territories, with the smaller States and Territories having higher 
sampling fractions. Personal data on name, age, sex, marital status and birthplace were obtained 
by interviewers for matching with information on the census form. For each person in the 
survey, information was sought on their place of usual residence, where they spent census night, 
their address before and after census night and any other address where they might have been 
included on a census form. At each given address, the personal information was matched to 
census forms to establish whether a person was missed, counted once or the number of times 
counted if counted more than once. 


Scope and Sample Structure of the PES 

Except for the special cases mentioned below, the PES included in its scope all persons who 
should have been enumerated in the census, except those who had gone overseas or died between 
the census and PES dates. Diplomatic representatives and persons in diplomatic dwellings were 
not included in the census. These persons were excluded from the survey as were babies born 
after census night. Persons in the survey who were overseas on census night were matched to 
census forms to determine whether they were incorrectly included in the census. 

For practical reasons, very sparsely settled areas were not included in the PES. In these areas, 
special census procedures were used to contact and enumerate Aboriginal groups, people in 
mining camps, cattle stations, etc. The PES in these areas would need to rely on the same con- 
tacts and procedures adopted for the census and therefore could not accurately and 
independently measure under-enumeration. Consequently, the scope of the PES excluded these 
areas. 
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Non-private or special dwellings such as hospitals, hotels, and motels also were not included 
in the PES. The vast majority of residents in non-private or special dwellings would have been 
short-term residents and, according to normal ABS survey rules short-term residents would 
have a chance of being included in the survey at their place of usual residence where informa- 
tion on such persons would be obtained. A relatively small number of long term residents of 
these dwellings were consequently not included in the PES. For estimation purposes, popula- 
tions out-of-scope were assumed to have the average capital or non-capital city rate of under- 
enumeration for each State as appropriate and the average Territory rate for each of the two 
Territories. 

As non-private or special dwellings and sparsely settled areas contained less than 3% of the 
total population, any differences in under-enumeration of these areas compared with areas 
covered by the PES would be unlikely to have a significant effect on the overall estimated level 
of underenumeration at the State or National level. 


Interaction Between the Census and the PES 

It is important that the PES be conducted as independently of the census as possible. Other- 
wise, the factors that led to a person being missed or overcounted in the census may also be 
present in the PES, resulting in biased estimation of the under-enumeration. Furthermore, 
knowledge of the areas to be included in the PES might influence the performance of census 
collectors in these areas so that the PES sample would not be a representative sample of the 
under-enumeration. For these reasons the field and office staff used in the census and PES 
were totally separate. PES interviewers were not employed as census collectors or census group 
leaders, and census field staff were not told which areas were included in the PES. 

Independence was further guaranteed in two ways - by ensuring the operational independ- 
ence of the field systems, and by adopting special procedures for census forms received by mail 
after the PES field work commenced. 

To ensure operational independence, PES field work commenced after all available census 
forms had been collected from the field. Thus census collectors were not in the field at the same 
time as PES interviewers and there was no possibility of interaction, even unintentional, between 
census and PES field staff. 

Special procedures for census forms received after the PES commenced were required to 
overcome the effects PES fieldwork may have had on householders who were late returning 
their census forms. In some cases, PES interviewers discovered census forms still uncollected. 
This situation was possible because some people had preferred to post in their census forms 
and had not yet done so, or the census collector had been unable to make contact to collect 
them. Some of these people who were included in the PES may have been prompted to post 
their forms in, where they would not otherwise have done so. To overcome this potential bias, 
any census form returned by mail after Monday 20 July 1986 (the day PES interviewing com- 
menced) was considered a late form. Special procedures for the treatment of late forms are 
described later in this Appendix. 


Matching procedures of the PES 

Matching for the purpose of determining whether a person was missed, counted once or 
the number of times counted if counted more than once, was conducted in two stages. Both 
these stages were clerical processes undertaken by staff at the census Data Transcription Centre. 

The first stage was the locating of census forms for the addresses of households selected 
in the PES. Processing of 1986 Census forms were centralized in Sydney. Staff at the Popula- 
tion Census Data Transcription Centre were requested to compare the address on the front 
of the PES interview form with all addresses given in the record book of the census collector 
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who was responsible for the collection district (CD) in which the PES household was located. 
The record book was used as a control in the delivery and collection of census forms, and con- 
tained information such as name, address and number of persons for all households in the CD. 

To assist identification of households where addresses were sometimes vague, for example 
in rural areas, processing staff were asked to also use names of the householders, property names 
etc. In addition, staff were instructed to check through all addresses in the record book so that 
any duplicate census forms were identified. Addresses in record books of adjacent CDs were 
also checked if the address of the household selected by the PES was near the boundary of 
the: CD. 

The second stage was person-matching and this was based on the name and demographic 
details of the persons listed on the census and PES forms. In this matching process, a search 
form was generated for each address reported in the PES for any person in the household, 
other than the address of the PES selected dwelling. A search form was treated the same way 
as a PES interview form and an attempt was made to locate the census form which corresponded 
to the search form address. 

In most cases, the person-matching procedure was straight forward. There were, however, 
cases of spelling errors and insufficient details on addresses to identify a clear match on name. 
In these cases, a judgement on whether or not a person was counted was made based on other 
information such as age, sex, marital status, birthplace and relationship to other members in 
the census household. For doubtful cases, processing staff were required to consult their 
supervisor. 

The PES also asked the respondent whether each person was included on a census form. 
When matching failed because of lack of adequate information, the respondent’s statement 
about whether or not the person was counted was accepted. There were a few cases where even 
this information was unavailable. These cases were considered not counted in the census. 

After matching, the data was entered onto computer tapes, edited and reformatted to pro- 
duce aclean unit record file giving the number of times person in the PES sample were counted 
in the census. 


Treatment of Late Census Forms and ‘Dummy’ Census Forms 
In forming the estimation equation: 


Ammar ¥ (06/91) sere 

X = estimated census count adjusted for underenumeration 

Y = raw census count, unadjusted 

x = PESestimate of the number of persons who should have been included in the census and 
y = PES estimate of the number of persons who were included in the census, 


two categories of census forms were treated as missed in the census. These are ‘dummy’ census 
forms and late census forms. 

Dummy census forms were created during census fieldwork for dwellings at which 
households were known to be residing, did not return their census forms and could not be con- 
tacted. Census collectors were instructed to exercise extreme care in creating these dummy forms 
and they needed to be satisfied that there was concrete evidence that the dwellings were occupied 
on census night. The collectors were instructed also to obtain as much information as possible 
regarding the number and the demographic characteristics of these residents. 

When a PES address was matched to a dummy census form, the lack of name and reliable 
personal characteristics on the census form made it impossible to perform the matching oper- 
ation satisfactorily. 
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It is also necessary to handle late census forms differently from normal census forms. Because 
late census forms might have been prompted by a PES interviewer calling, their inclusion could 
lead to a bias in the estimation of under-enumeration. 

In the 1986 Census, there were 115,000 persons recorded on dummy census forms or late 
census returns, or 0.7 per cent of the population. Both dummy and late census forms were 
excluded from the raw census count (Y) and the PES estimate of the number of persons who 
were counted in the census (y), but were included in the PES estimate of the number of persons 
who should have been counted in the census (x). In other words, persons on dummy and late 
forms were treated as missed and adjusted for by (x). The adjustment factor (x/y) is exag- 
gerated because of the exclusion of dummy and late forms from (y), but this exaggeration 
is compensated for by the exclusion of these forms from the raw census count (Y). 


Estimation Procedure 

The estimation procedure was applied at the age by sex by geographic area (capital city 
statistical division/rest of state) level. Adjustment factors were included in the estimation for- 
mulae to partly account for non-responding and non-contact households. These factors adjust 
both of the main estimates, x and y, by effectively imputing, for each non-contact or refusing 
household, the average number of persons per household, and, for each person so imputed, 
the average rate of under-enumeration at the relevant age by sex by area level. To reduce the 
bias from the use of such adjustment factors, the factors were calculated for various subgroups 
of households by the status of enumeration at the census (such as occupied dwelling, late 
returned form). This enumeration status was considered to be related to what non-response 
was encountered in the PES. 
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When Are Census Counts Improved 
by Adjustment? 


NOEL CRESSIE! 


ABSTRACT 


There are persuasive arguments for and against adjustment of the U.S. decennial census counts, although 
many of them are based on political rather than technical considerations. The decision whether or not 
to adjust depends crucially on the method of adjustment. Moreover, should adjustment take place using 
say a synthetic-based or a regression-based method, at which level should this occur and how should 
aggregation and disaggregation proceed? In order to answer these questions sensibly, a model of under- 
count errors is needed which is ‘‘level-consistent”’ in the sense that it is preserved for areas at the national, 
state, county, efc. level. Such a model is proposed in this article; like subareas are identified with strata 
such that within a stratum the subareas’ adjustment factors have a common stratum mean and have 
variances inversely proportional to their census counts. By taking into account sampling of the areas (e.g., 
by dual-system estimation), empirical Bayes estimators that combine information from the stratum average 
and the sample value, can be constructed. These estimators are evaluated at the state level (51 states, 
including Washington, D.C.), and stratified on race/ethnicity (3 strata) using data from the 1980 post- 
enumeration survey (PEP 3-8, for the noninstitutional population). 


KEY WORDS: Emprical Bayes estimation; Loss functions; Measures of improvement; Quantile 
function; Spatial correlation; Synthetic estimation. 


1. INTRODUCTION 


This article is of a technical nature, but it is important to present a brief explanation of the 
political and social ramifications of the ‘‘undercount issue’”’ in the United States of America. 
By December 31 of the year of the decennial census, the U.S. Census Bureau is specified by 
law to submit state population counts to Congress for the purpose of reapportionment of the 
House of Representatives, and by March 31, 1991, to submit small-area population counts 
for the purpose of redistricting. In recent decades, the number of uses to which census data 
are put have multiplied: revenue-sharing formulas use population and per capita income for 
each incorporated place, demographic and sociological research at regional, state, and national 
levels usually rely on census counts, efc. 

Inaccurate census counts should be cause for concern to the whole nation. That certain 
groups of people (young black males, illegal aliens, etc.) are harder to count than others, is 
without question; see Ericksen and Kadane (1985), and Freedman and Navidi (1986), and the 
discussion following these articles. If the hard-to-count groups were distributed in equal pro- 
portions throughout the political and administrative regions of the USA there would be far 
less controversy over what to do about the uncounted people. As it is, many of the large 
American cities such as Chicago, Detroit, New York, and Los Angeles feel they are losing 
federal funds because their cities contain more of the types of people that tend to remain 
uncounted. And certain states such as New York and California feel they are under-represented 
in Congress, to the benefit of Midwestern states such as Indiana and Iowa. 
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Census undercount is defined simply as the difference between the true count and the census 
count, expressed as a percentage of the true count. My approach to its estimation is model- 
based, relying on data obtained from the post-enumeration survey (PES). A number of technical 
aspects of a model-based approach to adjustment will be addressed in this article. Section 2 
establishes the model, addresses the question of choice of measures of improvement, and 
presents results for aggregation and disaggregation based on Bayes and Synthetic estimators. 
Section 3 gives empirical Bayes versions of the results of Section 2. Section 4 summarizes what 
has been learned from this model-based approach; there is also discussion of the implications 
of the sufficient conditions that guarantee risks of adjusted counts to be smaller than risks of 
census counts. 


2. THE MIXTURE MODEL AND ITS CONSEQUENCES 


At the outset I would like to explain the source of random variation in my model, originally 
defined in Cressie (1986), and further developed in Cressie (1988). I consider the true popula- 
tion in any well-defined stratum of the USA, to be unknown. After observing the correspon- 
ding census population, the uncertainties about the true population are updated. In other words, 
all inference will be performed conditionally on the observed census counts. 


2.1 The Model 


The method of synthetic estimation constructs estimators of undercount at a particular level 
(e.g., the state level) by summing undercounts of various strata (e.g., demographic strata) over 
the area being considered (e.g., California), where it is assumed that any stratum has a constant 
proportion of true counts to census count regardless of which area is being considered. For 
example, it would be assumed that the proportion for young black males is the same for 
California, Delaware and so on. Most often these strata are defined demographically according 
to the factors of age, race, and sex. However Tukey (1981) suggested that geographic and urban 
factors should be added. Two such stratifications of the USA are given in Isaki ef a/. (1986). 

The mixture model I am proposing assumes a stratification has been defined already, 
although in Section 4 there is a suggestion how one might determine post hoc whether a chosen 
stratification is satisfactory. 

Suppose there are = 1, ..., Jstrata, andi = 1, ..., J areas (e.g., at the enumeration- 
district level, J ~ 300,000, while at the state level, 7 = 51, including the District of Columbia; 
for demographic stratification, J = 30 say, while for the two stratifications in Isaki et al., 1986, 
J = 90and J = 96. Think of stratum / as fixed (for example, stratum / might be the blacks 
in central cities in those SMSA’s whose population’s greater than or equal to 250,000, in the 
New England Census Division). Then as / ranges from 1, ..., 7, a sequence of subareas is 
generated; the subarea indexed by ‘‘ji’’ refers to that part of the i-th area that has stratum / 
in it. Only subareas with nonzero census counts are considered. 


Define 


Y;; = true count in the j-th stratum of area / (271) 


Ill 


census counts in the j-th stratum of area j (2.2) 


Vere Xj/ Chale dirwemibaieswli ainizede (2.3) 
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Suppose for the moment that we know the ratios {Fi 7 = 1, ..., J} for the i-th area. Then 
from the census counts C;;, the true count Y; can be calculated. 


¥j =). Fi. (2.4) 


The F;; are often called adjustment factors. The strata are constructed so that these adjust- 
ment factors {Fj;: i = 1, ..., J} are as homogeneous as possible within the j-th stratum; 
y= 1; ...,Jd @ukey 1981). 

Realistically the adjustment factors are never known; synthetic estimators exploit the 
homogeneity and replace (2.4) with 


Ih 
ys? = Da F;Cji. (2.5) 
i=l 


Now there are only J synthetic adjusment factors {Fj: 7 = 1, ..., J} to estimate, which 
through (2.5) yields an estimate of Y;. Synthetic estimators have the advantage that the adjust- 
ment factors are independent of i and so can be applied to any level of aggregation. 

The (estimated) adjustment factors could also be modeled by regression on independent 
variables that may or may not be census variables; for example, percent minority, crime rate, 
and percent conventionally counted in the census. Consider, 


J p 
V3 — ay ( yes B24.) or (2.6) 
j=l k=1 
To fit the parameters @ 1,j2 +++» Bp; efficiently, various assumptions are made about the error 


components {Fj — y?_, Bx, ;2k,ji} » viz. independent and identically distributed with mean zero. 

Ericksen and Kadane (1985) propose the fitting of a regression relation to ¥ 4 Fi Cyi/ 
5! LTC): i = 1, ..., J. Freedman and Navidi (1986) criticize the approach and point out the 
consequences of failure of any of the error assumptions. A problem they did not perceive which 
I emphasize in (2.7) below, is the heteroskedasticity forced onto the problem by working with 
ratios; Section 2.2 justifies this model choice. Furthermore, in this latter regression approach 
undercounts across strata are combined, so that variation between strata is shared by both the 
regression relation and the error variance. More precise estimators can be obtained through 
(2.6) by allowing each stratum its own regression relation. Homoskedastic errors and a regres- 
sion model based on the combination of heterogeneous strata, are also assumed by Ericksen 
and Kadane (1987) and Ericksen, Kadane and Tukey (1987). It seems that the combination 
of heterogeneous strata was made necessary by the lack of suitable data. 

I do not assume F;;’s that depend only on /, nor a regression relation for the TEP ARUN 
instead reformulate the synthetic assumption Fi; = Fj, into a (statistical) homogeneity 
assumption: 


IE NE DY a TGS rer Iw ee rey Ns DR G4) 


where ‘‘ ~’’ means ‘‘is distributed as,’’ and N( 1,07) is anormal distribution with mean p and 
variance o”. Using a regression relation for the mean has the potential of explaining more of 
the variation of the Fj;’s at the risk of introducing bias through misspecification. The strata 
chosen in Section 3 are based on race; it was decided not to cloud this sensitive issue with 
selection of controversial regression variables. I shall refer to the model (2.7) as a mixing 
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distribution. The normality assumption is made for convenience and will be relaxed later. 
Here F; is a fixed but unknown mean to be estimated, and 7} = var (VC; Fj;) is a param- 
eter I shall call the (standardized) stratum variance. As a representation of reality, model (2.7) 
is better at higher levels of aggregation; see Section 3. All distributions in (2.7) are assumed 
independent. 

There are good reasons for weighting the variance by 1/C;; (see Cressie 1987a, Appendix 
and 1988). The most attractive consequence of model (2.7), is that it is /evel-consistent; that 
is, it is preserved through different levels of aggregation. Specifically, 


2 


Tj 
detaeUN (« zo) ; (2.8) 
iki’ 
where 
FC + Fr Gir 
Livia? = fustko$ayrions and Cj iki’ = (OF t Ci. (2.9) 
J yi&i’ 


This is a very important property that most of the currently proposed statistical models of under- 
count do not possess. It enables the modeler to escape from the geographical and historical 
accidents that divided up the country into the states, counties, efc., that we now see. 

Of course’the (Fj: i = 1, <5 I; j = 1, ..., J} are not available as data; if they were, 
{Y;:i = 1, ..., 1} would be trivial to calculate. In reality, some sampling takes place so that 
F; is observed imperfectly. The best way to think of it is that within stratum / of the i-th area, 
a sample is taken for undercount. Let the outcome be X;; (e.g., Xj is the ratio of dual-system 
estimator to census count, for the j-th stratum in the i-th area), and model 


Xi ~ N (Fi,o7/Ci)s i= 1,...57 =1,...,4; (2.10) 


where Fj; is an unknown mean parameter to be estimated, and o; =) Var (VC; Xj) 18 a par- 
ameter I shall call the (standardized) sampling variance . All distributions in (2.10) are assumed 
independent. When the number of strata is large, a large PES (say, 300,000 households) is 
needed to obtain data for each area-stratum combination. 

Probability-proportional-to-size sampling was used by the U.S. Census Bureau in its 1980 
post-enumeration program, which implies a sampling variance of the form given in (2.10). As 
a consequence of this weighting, (2.10) is also level-consistent. 


2.2 Loss Functions (Measures of Improvement) and their Bayes Estimators 


The term loss function is used in statistical decision theory (see, for example, Ferguson 1967) 
to quantify the loss incurred from using 6 as a parameter estimator when the true value is 0. 
For example, a squared-error loss function is (6 — 6)*. Adopting a more optimistic termi- 
nology, the Census Bureau decided in 1986 to use ‘‘measure of improvement”’ instead of ‘‘loss 
function.”’ 

Think of (2.10) as a conditional distribution of X;; given Fj;, and (2.7) as the mixing (or 
‘‘prior’’) distribution of F;;. To predict F;; then, the ‘‘posterior”’ distribution of Fj; given X;; 
is needed. Notice that a Bayesian terminology is being used since I am thinking of the Fj; as 
random variables whose collection is modeled according to (2.7). But as well as these random 
parameters, there are fixed but unknown parameters {F;}, {77}, {07} to be estimated. The 
posterior of Fj; | Xj; 1s, 
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(distribution of Xj; | Fj;) - (“‘prior’’ of #5) 
— ewes, (2.11) 
marginal of X;; 


For squared-error loss, the usual Bayes estimator of F; is simply the expectation of Fi, with 
respect to the posterior: Fryba = E (Fj; | Xj). Substituting the model (2.7), (2.10) into (2.1 ‘Oe 
the posterior distribution is easily obtained (see, for example, Lindley and Smith 1972): 


Pi 92 
FX NG + Cet hee iC) ; (2.12) 
7? a a; 7? FU of 


fori=1, ..., 7; j=1, ..., J. Hence the posterior expectation is simply 
bake 
Fy = F, + Dj(Xy — F)), (2.13) 


where D; = TH (7? + o; ). To convert (2.13) into an empirical Bayes estimator, estimators 
have to be found for F, and D;; see Section 3.1. 

Although the normality assumptions in (2.7) and (2.10) were used to derive (2.13), more 
generally (2.13) can be shown to be Bayes for squared-error loss, when assuming simply the 
mean and variance structure of (2.7) and (2.10), and E (Fj | Xi) = aj + b,X;;. Goldstein 
(1975) has an even more general result of which this is a special case. For ease of exposition 
I shall continue to assume normality but it should be remembered that there is a nonparametric 
optimality for all the estimators considered. 

The estimator Nips given by (2.13) is Bayes for squared-error loss, within the j-th stratum 
of the /-th area. Define the estimator of Ves, 


ji? 


J 
yj = op BAG ee at 6 (2.14) 
y=1 
and consider the following general loss function: 
I 
yy OP XG), (2.15) 
i=1 


where f(C;) is any nonnegative function of the i-th area’s census count. Minimizing 
(2.15) over all Y" = Yj. Fi" Cj leads to choosing Ff*’s such that E [ Y/_, Y/_, vgs" 
(Fi — Fy)? | (X:i = 1, ..., 157 = 1, ..., J} ] is minimized, where the \ji = 0 only 
depend on census counts (C;:2 = 1,....,/; j =1,...., £}. This minimum is achieved by 
the estimator (2.14), which shows it to possess a certain robustness since it is optimal regard- 
less of which f(-) is chosen. 

In accordance with recommendation 7.2 in National Academy of Sciences (1985), choice 
of f(C;) = 1/C; yields an area’s contribution to the total loss that reflects the size of its 
population. Among the loss functions the Census Bureau has been using, the one most like 


(YS — Y,)?/Y;; (2.16) 


we 


Il 
— 
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it is ‘‘most like’’ in the sense that it is also a weighted sum of squares where each summand 
yields an area’s contribution to the total loss that reflects the size of its population. Here, under- 
count in more populous areas receive more weight, so that using such loss functions reflects 
an emphasis on national considerations. The loss function Xt GX) Vigahicn 
guarantees undercount equity for the J areas, will not be considered in this article. 

It is easy to show that the Bayes estimator in the case of loss function (2.16) is given by, 


J 
ye ki Bq y FCs) -1 | {Xi = | ge eagle fe | me Ql 
git 


which is not a linear combination of (Fy: j = 1, ...,/}. However to a first approxima- 
tion, using the -method, it can be shown that this Y*' = Yj">*. This is in fact true for a much 
larger class of loss functions suggested by Cressie (1987b): 


Vi ysst 
x rea MEY ey Mor he 0 ele 2.18 
: Serene ae |) |+ } f cal 


the cases \ = 0, — 1 are defined as the respective limits of L* as \ — 0, — 1. Read and 
Cressie (1988, Chapter 8) show that in this case the Bayes estimator is 


df 
yo e(( ys FCs) ai th coin dy enter dace heatenls n) | Sain ci ery (oe 
j=! 


which reduces to (2.14) when \ = — 1, and to (2.17) when A = 1. 
The curious fact is that most undercount estimators used are optimal (under various model 
assumptions) for \ = — 1, but their performance is measured using \ = 1; i.e., (2.16). The 


6-method argument gives Y{*) ~ y,2>?, and recall Y;">* is optimal for (2.15); therefore 
squared-error loss estimators of undercount perform well according to a large class of loss func- 
tions. This was observed by Kadane (1984) in his heirarchical Bayesian analysis of 1980 census 
undercount data (A = — land = — 2 were compared), and confirmed on the studies of 
artificial populations carried out by Cressie and Dajani (1988). 

It has just been demonstrated that the estimators (2.13) and (2.14) are Bayes (or approxi- 
mately so) for a large class of loss functions. However it is not likely that the ensemble prop- 
erties of apart — ae ,1;j = 1, ..., 7}, estimate the corresponding ensemble properties 
Ol Dota ales ee td HERS Ha : er well. This follows from the inequality var(#) = 
var(E(@ | X));in other words the posterior mean of the parameter has a smaller variance than 
the parameter itself. For estimation of state population totals, this does not matter, but for 
estimation of the distribution of say {Fj,Cjj:i = 1, ..., 51); yj = 1, ...,J,or{¥;:7 = 1, 

, 51}, (2.13) is ill-suited to the task. Such a distribution is needed in standards research 
(Mulry-Liggan and Hogan 1986) to determine the proportion of people in a stratum affected 
by an undercount more severe than u% (Cressie 1988, Section 4). 

I shall constrain the estimator of {Fj;:i = 1, ..., 7} so that the posterior moments of its 
(weighted) empirical distribution function match the moments of the estimator’s weighted 
empirical distribution function. This is achieved by modifying the usual Bayes estimator, 
yielding a constrained Bayes estimator with the right ensemble properties. Louis (1984) presents 
the details for an equal-variance version of the model (2.7), (2.10), but a straightforward 
modification of his approach is possible for weighted variances. Cressie (1986) shows that such 
a constrained Bayes estimator is 
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Fesba = & i G(X; — Cs (2.20) 
J 

y;ra = ye Feed (2:21) 
j=l 


obtained by solving for ¢; and G; in: 


G+ G(X. — &) = Fy + DAX). — Fi): 
OE (GE Gn) i = 9? = 
i h 


(1 — 1)Djor/ Cn + DP YY (CE Cn) (Xj — Xj.)’, 
- ; (2.22) 


where 
I 


I 
OSH DENTE YEP OC, | (2.23) 
tas) =i 


2.3 Risks of Adjustment; Model Parameters Assumed Known 


The model-based approach described in the previous section specifies undercounts in various 
area-strata combinations, to be random variables. When it comes to comparing the value of 
one adjustment procedure against another, the expected loss (or the risk) is used. Statistical 
procedures with small risk are preferred. 

In the absence of other considerations (e.g., political, practical, efc.), implementing the pro- 
cedure with the smallest risk is the correct, impartial approach. The statistician knows that 
adherence to this modus operandi will yield better estimates on the average, where the average 
is taken over all problems considered by the statistician. However there is nothing to guarantee 
that for the particular problem being considered, here estimation of undercount in the 1990 
census, a set of area-strata estimates derived from the criterion of minimum risk will actually 
have smaller loss than another set of estimates. To put it more succinctly, the inequality 
E(V*) < E(W7) does not guarantee that V2 < W? fora particular realization. If, in the 
light of the data collected, a minimum risk prediction did not prove to be the most accurate, 
the statistical procedure should still be seen as optimal. 

In the rest of this section, various results about Bayes estimators will be stated (proofs are 
given in Cressie 1988). Needless to say, these results rely on the correctness of the assumed 
model. In practice, the more relevant results are for empirical Bayes estimators, which are given 
(with proofs) in Section 3. 

The first thing to recall (from Section 2.2) about the usual Bayes estimators (2.13), (2.14) 
is that they are optimal or near optimal for a large class of loss functions. Moreover the 
estimators are level-free; i.e., they are not only optimal at the level at which they are constructed, 
but after aggregation they are also optimal at the higher level. From (2.14), 


yay he yy = yeubay (2.24) 
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where i&i’ denotes the area obtained by combining the two disjoint areas / and 7’. 


Therefore, one should aim to construct a Bayes estimator at the very lowest level (census 
blocks) and aggregate up to whatever level is desired, thus ensuring consistency of counts at 
all levels. In practice this is out of the question, simply because the post-enumeration survey 
would never be large enough to give dual-system estimated undercount data for all the blocks. 
The same is true at the enumeration-district level and the county level. Moreover, at these lower 
levels the model (2.7) and (2.10) does not fit as well (Cressie and Dajani 1988); an adequate 
fit at the state level is shown in Section 3.1. 

It is certain that the post-enumeration survey will gather data from each of the 51 states, 
allowing construction of (empirical) Bayes estimators at the state level. Politically, the state 
level is the most sensitive; reapportionment of the 50 states’ representation (Washington, D.C. 
is excluded) in the House is the first use made of decennial census counts (mandated to reach 
Congress by December 31 in the year of the census). Thus at this level, the Bayes estimators 
(2.13) and (2.14) offer a compromise between a state’s observed adjustment factors {X;j: 
1 , J}; and the (synthetic) adjustment factors {Fj: / = 1, , J} .. For example: 
MERE aii s black undercount is recognized as being potentially different from New York’s 
black undercount, when using the Bayes estimators. 

I shall now explore the consequences of synthetic estimation at lower levels, after Bayes 
estimation is carried out at a given level. For consistency of counts at all levels, it is desired 
to estimate undercount at the block level and aggregate up to whatever level is desired. Suppose 
an adjustment factor Fij est is estimated for the j-th stratum in the /-th area. Now suppose 
i = i; & iy; i.e., the i-th area is split up into two disjoint subareas i, and i,. Then the synthetic 
method at the lower level posits, 


Bam ae pl? (2.25) 


so that estimators of the true population are given by, 


iit ati =i i Cite van jy Cig (2.26) 


Notice that from (2.25) and (2.26). 


J 
peony oie iy site wy Fs (22 


which is the desired disaggregation-aggregation property. 


Compare the risk of using ype, ys¥?, and Y{> (given by (2.14), (2.5), and (2.21) respec- 
tively) to the risk of using C;, the census count of the i-th area. Using the loss function (2.15), 
the risks are: 


uba-risk; = E[(Y?>* — Y,)?f(C)], (2.28) 
cen-risk; = E[(C;— Y;)7f(C))], (2.29) 
sya-risk; = E[(Y°* — Y;)7f(C)], (2.30) 
cba-risk; = E[(Y¥ {> — Y;)*f(C,)]. (2.31) 
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The following sequence of inequalities can be proved (Cressie 1988): 
uba-risk; < cba-risk; < sya-risk; < cen-risk;, (2.32) 


where the middle inequality requires 07/7? < 3; j = 1, ..., J. 


Now compare the risk of using bei and Ns (estimators of Y;, and Y;, respectively) based 
on fee in (2.25), with the risk of using C;, and C;,, where area i = i; & ij, the union of 
disjoint areas /; and i. It can be shown (Cressie 1988) that the synthetic estimation based on 
the usual Bayes estimator defined at a particular level but applied at a lower level, always has 
smaller risk than the census counts. 

It is also of interest to determine the behaviour of the census-based risk minus the Bayes- 
then-synthetic-based risk as a function of the level; the larger this difference, the more advan- 
tageous it is to adjust the census counts. Here use f( C;) = 1/C;in loss function (2.15). It is 
possible to show (Cressie 1988) that as disaggregation proceeds to a lower level, the ‘‘risk gap”’ 
between Bayes-then-synthetic estimation and census counts widens in absolute terms. Although 
this is proved there for the uba-then-synthetic- based estimator, the same is true for cba-then- 
synthetic-based and sya-then- synthetic-based estimators, and the ordering of risks (2.32) is 
preserved at any level of disaggregation. This conclusion depends on the model (2.7) and (2.10) 
holding at a// levels. Unfortunately at the lower levels there is some evidence that biases can 
be substantial. That is, E( Fi) = Fj + by; E(X;i | Fiji) = Fj + dj. Realistically bjs and 
dj's are never zero, but at sufficiently high levels of aggregation they are unimportant. At the 
block and enumeration-district level they can be substantial (Cressie and Dajani 1988) and could 
invalidate the risk inequalities proved so far. Moreover, at lower levels, the data { X;;} are 
more variable leading to less precise estimates of D; = 77/(7? + o7) in the empirical 
Bayes version (see Section 3) of the Bayes estimator (2.14). These observations, as well as a 
recognition of the difference between risk and loss, help to explain the deterioration of the 
performance of the adjusted counts at lower levels, observed in artificial populations (Schultz 
et al. 1986). 


3. EMPIRICAL BAYES ADJUSTMENT OF CENSUS COUNTS 


Obtain from (2.14), (2.21), and (2.5), the estimated (or adjusted) true area counts Y2°?, 
Y*) and Y;"", respectively. In order to make these functions only of the data, estimators are 
needed for the unknown parameters Fj Te and of; Fay and Herriot (1979) give empirical 
Bayes estimators in a regression setting, of which the model (2.7), (2.10) is a special case. For 
reasons of statistical consistency (see Cressie 1986, Section 3.3), choose, 


|e, Ce (3.1) 


ij = all> Cl (Gi > 0) (Xi - zie (Oe (Cj > 0) — ') | — 6, of Face 


6} is obtained from sampling considerations: it is known for dual-system estimation, and 
Schultz et a/. (1986) determine it for their artificial populations by replicating probability- 
proportional-to-size sampling of 1,440 enumeration districts from the approximately 300,000 
total number. 
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Statistical stability (i.e., small sampling variance) for sample means is easier to achieve than 
for sample variances. The coefficient of variation of the sample variance is approximately 
2/Nn; therefore to achieve a relative confidence region (0.5, 1.5) for the population 
variance, a value of n = 32 is needed; and to achieve a region (0.95, 1.05) a value Ot we = 32200 
is needed. Thus the estimator, E/_;CjI(Cj > 0) (Xj — Xj-)7/ (Lied (Ca > 0) — 1) 
of 1} + of is very unstable, particularly when there are a large number of strata and hence 
y j=] (Cj > 0) is small (smaller than 30). 

One way around this is to introduce a further mixing distribution into the problem, namely, 
model the {7}: j = 1, ..., J} as being generated by the reciprocal of a gamma distribution 
for example. Thus instead of estimating J parameters {7}: goa Po }OIEAY the problem-can 
be reduced to estimating just two gamma parameters (see e.g., Hui and Berger 1983). Another 
possibility is to aggregate temporarily some of the strata for the purpose of estimating the 
stratum variance. In other words, define disjoint groups of strata indices, Ay, ..., Ax, such 
that U {A,;:k =1,..., K} = (1,2, ..., J}, and 7? = 77 = 72, whenever j and j’ 
belong to the same A,. In this way, Cressie and Dajani (1988) reduce the number of stratum 
variance parameters from J = 96 down to K = 4. For the data analyzed below, since 
Sea (Cj; > 0) = 51 for each of the three race strata, it was not necessary to “‘borrow 
strength’’ in the ways just described. 


3.1. Emprical Bayes Estimators 


The usual (see, for example, Morris 1983) and constrained (Louis 1984) empirical Bayes 
estimators can now be constructed: 


Fee = xX). + (42 /(4? + 67)} (Xi — X}-), (3.3) 
J 

yes YFG t= 1, ... 7 be 
a 

Fi = Xj. RIG? + ON (Xy =X), G.5) 
J 

y° = » FE°Cjis Len (eed (3.6) 
j=l 


The usual empirical Bayes estimator (3.3) can also be obtained from standard theory for linear 
models with random effects (Henderson 1976). 


Notice that when 7? = (, the empirical Bayes estimators of the j-th stratum adjustment fac- 
tors all reduce to the synthetic estimator X;.. The presence of the weight (7?/ (7? + 6?) \4 
in the constrained empirical Bayes estimator (3.5) may look a little strange at first, but it is 
seen in Cressie (1987a) to yield an unbiased estimator of the stratum error Ci? (Fee Ppe 

An earlier suggestion for empirical Bayes modeling of undercount came from Dempster and 
Tomberlin (1980), who proposed that the number of undercounted people in a subarea might 
be a binomial random variable. They defined a heirarchical Bayes model but did not take into 
account the heteroskedastic variation. Stroud (1987) introduces a covariate into a two-stage 
Bayesian model, but his assumptions of homoskedastic variation and equal sample sizes in 
each subarea, are too restrictive for the problem considered in this article. 
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Formulas for the bias and mean-squared error of the usual empirical Bayes (ueb) estimators 
(3.3), (3.4), the constrained empirical Bayes (ceb) estimators (3.5), (3.6), and the synthetic 
estimators 


PX (37) 
J 
eae Cp tomel ete ct (3.8) 
j=l 
are given in Cressie (1987a, Section 4). Since undercount is a nonlinear function of the true 
population, its estimators based on {F': i = 1, ...,J;j7 = 1, ..., J}, viz. 
biel | ome nee ey J (3.9) 
jt we Fs’ - 2 eon hd | = PE eae) ’ * 
i 
est Ci , 
uj his a ee (3.10) 
i 


are biased; estimated biases and mean-squared errors can be obtained by the 5-method (Cressie 
1987a, Section 4). All of these bias and mean-squared error calculations do not take into account 
variation due to the (nonlinear) estimation of 7} / (77 + g?). 

Suppose that the following three U.S. strata (based on race/ethnicity) are chosen: blacks, 
nonblack hispanics, and others. Data from the post-enumeration survey following the 1980 
U.S. Census are given in Cressie (1987a, Table 1). These are from the noninstitutional popula- 
tion (Cowan and Bettin 1982) and have been labeled ‘‘PEP 3-8’’ by the U.S. Census Bureau - 

the ‘3’ refers to census omissions being obtained from an April survey and to imputing missing 
data, and the ‘‘8”’ refers to erroneous enumerations being obtained from a separate survey 
that imputed missing data with the help of U.S. Post Office information. 

From these data and (3.1), (3.2), Cressie (1987a) estimated the mean of the mixture distri- 
bution, and standardized stratum and sampling variances defined in (2.7) and (2.10): 


blacks: F,; = 1.06076 #7 = 673.982 6? = 522.183, (3.11) 
HONDIACK = sla4GGr @asewsas90D” 62 = 246.585 (3.12) 
hispanics: 2 ; 2 : , rts ; 

Others: F; = 0.99981 #7 = 242.134. 6? = 242.152. (3.13) 


Based on these parameter estimators and the PEP 3-8 data bag) = 1.22, oe = 1S oe LTS 
Cressie (1987a) gave undercount estimates {u/*'}, {u;*'} for ueb-based and syn-based esti- 
mators defined by (3.3) and (3.7) respectively. 

To check the fit of the model, the residuals {C# (Fg ~ Eee): i= 1, ..& [} were com- 
puted for each of the three strata. Table 1 shows the results, presented as stem-and-leaf plots 
for the three race strata; a bell-shaped plot for each is the ideal. The model appears to fit the 
data, except for the nonblack-hispanic stratum in the state of New York. In light of the lawsuit, 
Cuomo vs. Baldridge, heard by the Southern District Court of New York in 1983, this new 
way of looking at the data tells an interesting story. The nonblack hispanics in New York State 
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were grossly undercounted, even in relation to their undercounted fellow nonblack hispanics 
in other states. Incidentally, the judge decided in favour of the U.S. Department of Commerce 
(in December 1987) on the grounds that the statistical and demographic professions had not 
developed adequate methods of adjustment for the whole country by 1980. 

When are census counts improved by replacing {C;: i = 1, ..., 7} with Oe gee an 
..., 1}? The next section gives conditions under which an analogous ordering to (2.32) still 
holds in the empirical Bayes setting. 


3.2 Adjustment at Different Levels; Model Parameters Estimated 


The same comments at the beginning of Section 2.3 apply; in a model-based approach a 
small risk does not guarantee a small loss in every problem but only on the average. Also the 
analogous aggregation property to (2.24) holds for ueb-based, ceb-based, and syn-based 
estimators, namely 


Yoo VP Ye (3.14) 


for “est”? = “‘ueb,”’ ‘‘ceb,’’ and ‘‘syn,”’ given by (3.4), (3.6), and (3.8) respectively. More- 
over the disaggregation-aggregation property (2.27), namely 


YAP Yo Sere (3.15) 


where 7 = 1 & i, and Fi = Fiz* = Fi", holds for any estimator of F;;, including those 


based on ueb, ceb, and syn. 
Write the risk of estimating Y; by Yf*( = Y/_,F'C;,) as 


est-risk, = E[(Y{* — Y;)2f(C;)]. (3.16) 


The estimators given by ‘‘est’? = ‘‘ueb,”’ ‘‘ceb,’’ and ‘‘syn,’’ will be compared to ‘‘cen”’ 
(F" = 1) via (3.16). For the rest of this section consider the estimator, 


FS = 7X, + (1 — 7)X,.30< 7, = 1, (3.17) 


a convex combination of the data Xj; and the synthetic estimator X;.. Then 


J Ch (1 = 17)Ci 
est-risk; = > (1 — ate, _ z } + of fie fi # i (3.18) 


ipsa, weir an 
fel ok 
h 


It is easy to see that the value of r; that minimizes (3.18) is r; = D; = 17/(7? + 07); i.e., 


neglecting the effect of estimating rT and o}, I obtain 


ueb-risk; =< est-risk;; 0 < r; < 1. (3.19) 
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Now compare ueb-risk; (put 7; = D; in (3.17)) with cen-risk;; recall from (2.29) 
J 


J 
cen-risk; = Jo 7? Cif(C)) + ye | 2f( Cys (3.20) 
j=l 


j=l 
Also, by putting 7} = kjo?3j tN be ee Ae 


ueb-risk; = E of hE fe eee fasten. (3.21) 
h 


A sufficient condition for ueb-risk; < cen-risk; is, 


k Ci ] 
: ae z 2 < Ki 
b+ ke Ch 1th 
h 
that is, if 
o?/1? < 2 Cn! Ca) Wee ty (3.22) 
h 


then 


ueb-risk; < cen-risk;. (3.23) 


Similarly, it can be shown that if 


CT el ay tel eee) (3.24) 
then 
syn-risk; < cen-risk;. (8:25) 
Finally, if (o7/77) < 1, and 
Ci 2Cji 
stot *( ~ ofr (1 +—— ) + 25a 4, 1S 7 (3.26) 
Ly Gin Gin 
h h 
then 
ceb-risk; < syn-risk;. (3:27) 


once again (from (3.26)), if 07/7; is small, risks can be bounded. 
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Therefore an analogous sequence of inequalities to (2.32) is possible: 
ueb-risk; < ceb-risk; < syn-risk; < cen-risk,, (3.28) 


where the middle inequality requires the condition (3.26) and the last inequality requires the 
condition (3.24). If either of these two inequalities do not hold, at least the ueb-based estimator 
is an improvement over the census counts if condition (3.22) is satisfied. For the PEP 3-8 data 
from the 1980 U.S. Census, 


67/77 -= 0.77, 62/72 =.0.80, 67/72 = 1.00; (3.29) 


that is, for the 1980 U.S. decennial census the census risk is larger than the synthetic risk and 
the usual-empirical-Bayes risk is smallest of all. 

Now compare the risk of using Y;’* and Y;,”* (estimators of Y;, and Y;, respectively, 
based on F;* given by (3.17)), with the risk of using Cj, and C;,, where area i = i, & iy is 


19? 
disaggregated into two disjoint areas i, and i). 


2 | oe" ay rE | 


t=1 


1 —- rj r? > 3 30 
aes epee pet Geka (3.30) 


Deen 


It is easy to see that under precisely the same conditions (3.22), (3.24), (3.26), the same sequence 
of inequalities (3.28) holds; interpret est-risk; in (3.28) as being equal to (3.30) with 7; = D; 
for “est”? = “‘ueb,”’ with r; = Di for ‘“‘est’? = ‘‘ceb,’? and with r; = 0 for ‘‘est’”? = ‘‘syn’’. 
Moreover for the loss function (2.15) with f(C;) = 1/C;, risk gaps widen as lower levels of 
aggregation are attained. 


4. DISCUSSION 


Various assumptions are made in deriving the risk inequalities (3.28), all of which deserve 
further investigation. The model (2.7) and (2.10) is assumed to fit, and in particular the inde- 
pendence of distributions between subareas is assumed. Moreover, the effect of estimating D; 
in the empirical Bayes estimators of F;; is assumed negligible. Notice however that syn- 
thetically estimated F;;’s do not use an estimate of D; and so those risk inequalities only rely 
on the appropriateness of the model (2.7), (2.10). 

The conditions which order the various risks and bound them below the census risk in (3.28), 
all depend on o; / 7 being ‘‘small.’’ The practical implication is that a large number of 
households need to be chosen in the post-enumeration survey (PES) or there can be no guarantee 
that census counts can be improved by adjustment. With prior knowledge of stratum variation 
(e.g., from a previous census), the PES could be designed so that the conditions are satisfied. 
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After the survey has been conducted and the data (Xj: i = 1, ...,/;y7 = 1, ..., J} are 
available, the various conditions (3.22), (3.24), and (3.26) can all be checked by using the 
estimators 7? and 6; given by (3.2). 


Concentrate on the best convex combination of X;; and X;,, namely Fee given by (3.3). 
Then, ueb-risk; < cen-risk;, if (3.22) holds; i.e., if 


of /T; = { ye Cn/Ca) Se bate bt ac fe >) 
h 


Notice that the condition is less stringent when the i-th area has a small census population; 
conversely, areas of large census population may have a ueb-based estimated population 
further from the truth than census. A sufficient condition for (4.1) to hold is, o; / 7} so) 
j = 1, ..., J, which is also the condition that guarantees the syn-based estimated popu- 
lation improves over census. This condition was satisfied for the 1980 PEP 3-8 data (see 
Section 3.2). 


Finally, the condition (4.1) becomes less stringent at lower levels, and indeed the results of 
Section 3.2 show that the risk gap between the adjusted population and the census population 
widens. This deserves comment. The results are true provided the model holds at lower levels, 
but this is probably not the case at the block and the enumeration-district level. Presence of 
bias in (2.7) and (2.10); namely 


EF) = Fe Des EUG | fires, ae. (4.2) 


Ji» 


could cause a reversal in some of the risk inequalities. At the state level however, Table 1 and 
Cressie (1988) show through an examination of residuals, that (2.7) and (2.10) does fit for the 
1980 PEP 3-8 data. And since (3.29) implies that condition (4.1) is satisfied, one can be confi- 
dent that ueb-based adjusted state totals are closer to the truth than census state totals. That 
may not be true at the block level; clearly a decision regarding the level at which it is most 
important to have accurate census counts, needs to be made. The first use of U.S. Census data 
is the reporting of state totals to Congress for the purpose of redistricting House seats. One 
might include a number of large cities in with the states, and create e.g., the ‘‘states’’ 
New York City, and New York State Except New York City. It seems to me that this ‘‘state’’ 
level is the most sensitive politically and that accurate totals at this level should receive the 
highest priority. 
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ABSTRACT 


To estimate census undercount, a post-enumeration survey (PES) is taken, and an attempt is made to 
find a matching census record for each individual in the PES; the rate of successful matching provides 
an estimate of census coverage. Undercount estimation is performed within poststrata defined by 
geographic, demographic, and housing characteristics, X. Portions of X are missing for some individuals 
due to survey nonresponse; moreover, a match status Y cannot be determined for all individuals. A pro- 
cedure is needed for imputing the missing values of X and Y. This paper reviews the imputation methods 
used in the 1986 Test of Adjustment Related Operations (Schenker 1988) and proposes two alternative 
model-based methods: (1) a maximum-likelihood contingency-table estimation procedure that ignores 
the missing-data mechanism; and (2) a new Bayesian contingency table estimation procedure that does 
not ignore the missing-data mechanism. The first method is computationally simpler, but the second is 
preferred on conceptual and scientific grounds. 


KEY WORDS: Bayesian methods; Categorical data; Coverage error; EM algorithm; Multiple imputa- 
tion; Nonignorable nonresponse; Undercount. 


1. INTRODUCTION 


The U.S. Bureau of the Census has used a post-enumeration survey (PES) to evaluate cov- 
erage error in several past censuses, and it plans to conduct a PES after the 1990 Decennial 
Census as well. For each individual in the PES, an attempt is made to find a census record 
(i.e., amatch) to determine whether the person was enumerated in the census. The proportion 
of PES persons who were missed in the census is used as an estimate of the proportion of persons 
in the population who were missed. A similar matching operation is performed to match a 
sample of individuals from the census to the PES; this provides an estimate of the census over- 
count resulting from erroneous (e.g., duplicate or fictitious) enumerations. 

The data on matches and erroneous enumerations obtained from the PES are combined 
to estimate the population size via the dual-system estimator; this capture-recapture type of 
estimator is discussed in Marks, Seltzer and Krotki (1974), Krotki (1978), Wolter (1986), Dif- 
fendal (1988), and Fay, Passell and Robinson (1988, Chapter 5). Dual-system estimates of 
population size are computed within poststrata defined by geographic, demographic (age, sex, 
race), and housing (owner/renter, type of housing structure) characteristics. 

Two problems of missing data occur in the PES and complicate the estimation process: 
1. Geographic, demographic, or housing characteristics may be missing for a person, so it is 

not known to which poststratum that person belongs. 

2. After the processing of the PES, there are some individuals with match status (dichotomous 
variable indicating matched/not matched to census) or erroneous enumeration status 
missing. This can occur, for instance, when an incomplete name is obtained in the PES, 
or when there is difficulty in specifying a Census Day address for someone who moved 
between Census Day and the PES. 


! Donald B. Rubin and Joseph L. Schafer, Department of Statistics, Harvard University, Cambridge, MA 02138, 
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Missing data were a major source of uncertainty in undercount estimation for the 1980 
Decennial Census (Freedman and Navidi 1986; Fay, Passell and Robinson 1988, Chapter 6). 
Improvements in the PES design should reduce the amount of missing data in 1990 (Hogan 
and Wolter 1988), but a method for dealing with missing data will still be necessary. 

The 1986 Test of Adjustment Related Operations (TARO), a recent test of undercount 
estimation and adjustment (Diffendal 1988; Schenker 1988), used a PES that was similar in 
design to that planned for 1990. This paper reviews the methods used to handle missing data 
in TARO (Schenker 1988), identifies potential weaknesses of these methods, and discusses 
potential alternatives. 

Our goal is to indicate issues and problems, and to suggest methods for their solution. The 
long range plan for research is to carefully evaluate these methods. Although we only discuss 
imputation for missing PES data when estimating undercount, missing data also occur in the 
census sample used to estimate overcount. The missing-data problems in estimating overcount, 
however, are analogous to those in estimating undercount (Schenker 1988), and so our discus- 
sion applies to both problems. 

In our discussion of alternatives to the TARO procedures, we propose a new method based 
on a Bayesian model that does not ignore the missing-data mechanism, and thus does not assume 
that the missing data are missing at random. Nonignorable models for incomplete categorical 
data are a recent development in the theory of handling missing data; see Fay (1986), Little 
and Rubin (1987, Section 11.6), and Baker and Laird (1988) for discussions and reviews of 
the literature. Moreover, the types of missing data that we discuss occur not only in under- 
count estimation, but in many other situations as well; thus our discussion is relevant to the 
general problem of handling missing categorical data. 

Section 2 discusses the imputation methods used in TARO. In Section 3, alternative methods 
are described and illustrated using a simple example. Section 4 presents a concluding discussion. 


2. IMPUTATION METHODS USED IN TARO 


2.1 Description of Methods 


For each individual in the PES, let X denote categorical variables for age, sex, race, 
owner/renter status, and type of housing structure; let Y denote match status (1 = match, 
0 = nonmatch); and let Z denote variables indicating whether the PES interview was with a 
household member or a proxy, and whether the PES person moved between Census Day and 
the PES. In TARO, the X variables (except type of housing structure) were used in forming 
poststrata (Diffendal 1988); Z was observed for all PES individuals, but Y and components 
of X were sometimes missing (Schenker 1988). 


Missing values of X and Y were imputed in two stages. (Our description is simplified for 
ease of presentation; see Schenker (1988) for the precise procedure). First, all missing _X values 
were imputed using a “‘hot deck’’ scheme based on observed _X variables; that is, imputed values 
were drawn from the observed distributions of X values. Second, after the missing values of 
X were filled in, a logistic regression model predicting Y from X and Z was fitted to the cases 
with Y observed. This logistic regression model was then used to impute probabilities of match 
for all missing Y values. Probabilities rather than zeros and ones were imputed to (a) increase 
the precision of estimation, and (b) allow the assessment of variability due to imputation 
(Schenker 1989). 
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2.2 Critique of Methods 


The TARO imputation methods have many positive features. They are easily understood 
and use explicit modeling for the imputation of Y. They also condition on much of the observed 
data, rather than imputing from marginal distributions. Finally, in principle they allow the 
assessment of uncertainty in undercount estimates due to the missing Y values. The methods 
have some potential weaknesses, however, which we now describe. 

The TARO imputation procedure is an ‘‘ignorable’’ procedure, because it ignores the 
missing-data mechanism. Ignorable procedures assume that the missing data are missing at 
random (MAR) (Rubin 1976); that is, they assume that given the observed data, the missingness 
is independent of the values of the missing items. For example, if X and Z are observed for 
all people, MAR implies that Y can be imputed using the conditional distribution of Y given 
X and Z for those individuals having X, Y, and Z observed. 

The TARO procedure is actually a special case of an ignorable procedure, because it makes 
assumptions that are stronger than the general MAR assumption. The TARO procedure treated 
X and Y asymmetrically; that is, it imputed missing values of Y conditional on all observed 
data, but it imputed missing X values conditional only on the observed X’ s, rather than on 
the observed values of X, Y, and Z. Hence, in addition to the general MAR assumption, the 
TARO procedure also effectively assumed that, given the observed components of X, the 
missing components of X are conditionally independent of both Y and Z. 

This additional independence assumption may not be realistic; it may be that given the 
observed X data, there is a residual dependence of values of missing components of X on Y 
and/or Z. If this is the case, then observed values of Yand Z should be used in the imputation 
of X. For instance, suppose a PES individual has sex missing, but is found not to match any 
census record (Y = 0) on the basis of observed age, race, and address; and suppose males tend 
to be undercounted in the census more than females with identical other characteristics. Then 
knowing that Y = 0 provides some evidence that the person in question is more likely to be 
male than if Y were 1. The most general ignorable imputation procedure would use informa- 
tion provided by Y and Z in imputing missing X values; this is one of the alternative imputa- 
tion methods, which we discuss in Section 3.4.1. 

Another feature of the TARO procedure that may be unrealistic is the ignorability assump- 
tion itself. It may be that the missing data are not MAR — i.e., given the observed data, the 
missingness is not independent of the values of the missing items; if so, then it would be more 
appropriate to use a nonignorable model for the missing-data mechanism. For instance, con- 
sider a group of people with identical values of all variables except race; it may be more diffi- 
cult to obtain information on race for minorities than nonminorities, and consequently the 
distribution of race will be different among those missing race and those with race observed. 
Similarly, even after all X and Z variables are controlled for, it may be that people who were 
not enumerated in the census are more likely to be missing Y than those who were enumerated 
in the census. An alternative imputation method based on a general class of nonignorable 
models is presented in Section 3.4.2. 


3. ALTERNATIVE METHODS OF IMPUTATION IN THE PES 


3.1 Introduction 


Let X = (Xj, X, X3) denote three individual characteristics recorded by the PES (e.g., age, 
sex, and race). The variables X,, X>, and X3 are assumed to be categorical, taking /, J, and 
K possible values respectively. We have chosen three variables merely for illustrative purposes 
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and notational simplicity; all ideas developed here will extend immediately to any number of 
categorical variables. In practice, these X variables will probably include the demographic, 
geographic, and housing characteristics used to define poststrata for undercount estimation; 
they may also include additional PES variables, such as mover status and household 
member/proxy status, which are not of intrinsic interest but which may be useful for imputa- 
tion purposes. 

We will form JJK different classes of individuals by cross-classifying them according to X,, 
X>, and _X3. These classes may or may not be the same as the poststrata for undercount estima- 
tion; in practice the poststrata will probably be coarser than these classes. It is convenient, but 
not necessary, for these classes to be defined as cross-classifications of all possible values of 
X,, X>, and X3; more complicated patterns (such as nested ones) are also possible. We will 
be constructing loglinear models for cross-classified contingency tables, but loglinear models 
may be based on other patterns as well. 

Let Y be the dichotomous variable denoting match status, taking values 1 (matched to census) 
or 0 (not matched). If there were no missing data, the results of the PES could be summarized 
in a single four-dimensional contingency table with x J x K xX 2cells, since each individual 
could be fully classified according to _X,, X>, X3, and Y. But those individuals missing one or 
more variables can be only partially classified according to those variables that are observed. 
Those having X;, X>, X3, and Y all observed will constitute a four-dimensional table, which 
we will call the table of complete cases (CC), or the data table for missingness pattern 1 (no 
variables missing). Those having X,, X>, and _X; observed but Y missing will constitute a three- 
dimensional supplementary table with IJK cells, which we will call the data table for missingness 
pattern 2. In general, there will be 2* such tables corresponding to all possible missingness pat- 
terns, one CC table and 2* — 1 supplementary tables. 


3.2 Imputation from Reference Tables 


In our model-based approach to imputation, we will model the data tables for different miss- 
ingness patterns as multinomial observations. Corresponding to each missingness pattern, we 
will define a set of cell probabilities 0‘ = {Ohi}, where the superscript ¢ indexes the miss- 
ingness pattern,¢ = 1, ..., 2*, and the subscripts i, /, kK, and / indicate the levels of X,, X>, 
X;, and Y respectively. Because we will refer to 6‘ when imputing missing values for the ¢-th 
data table, we will call 9‘ the reference table for the ¢-th data table, and {0’: ¢ = l,..., 2a) 
the set of reference tables. 

Imputation of missing values corresponds to expanding each supplementary data table to 
make it fully four-dimensional, according to its corresponding reference table. For example, 
consider the imputation of Y for those individuals missing only Y. This is equivalent to expan- 
ding the supplementary data table for missingness pattern 2, by dividing each cell count in this 
table into two parts, a count of those having Y = 1 and acount of those having Y = 0, split 
according to the reference table 9”. With known 9? this procedure is straightforward: we first 
obtain from 9? the conditional distribution of Y given X for this missingness pattern, i.e., 


2 
Dijk 


P(Y = 1|X,, X>, X3,¢ = 2) = —“*—_, 
B70 + OFK1 


(1) 


for? —tsaneL, J = lpesixJpandkies ode, Ke Then, weimpute,V¥ii=p lion each observanion 
in cell ijk of this table with probability given by the right-hand side of (1); alternatively, we could 
impute the mean of this distribution, which is just the probability of a match (1). The relative 
merits of random draw versus mean imputation for the PES will be discussed in Section 3.3. 
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Note that in the example above, the only information from 07 needed for the imputation 
is the conditional distribution of Y given_X; hence, any value of e? yielding the same values 
for (1) leads to the same imputation procedure. For an imputation procedure to be accurate, 
then, our estimate of 9‘ need not correspond to the joint distribution of Y and X for the ¢-th 
missingness pattern; the only requirement is that the conditional distribution of the missing 
variables given the observed ones derived from our estimate of 0‘ be close to the correct one. 

In particular, if the missing-data mechanism is ignorable, one common reference table 
6! = O,t = 1, ..., 24, provides valid imputations for all missingness patterns, even though 
the joint distribution of X and Y might vary across missingness patterns. The fact that only 
one reference table is needed follows from the definition of ignorability, which implies that 
the conditional distribution of missing values given observed values does not depend on the 
missingness pattern. The value © that provides valid imputations is not Occ, the cell prob- 
abilities for the joint distribution of X,, X>, X3, and Y underlying the CC table; rather, it is 
the the joint distribution of X,, X, X3, and Y marginalized across missingness patterns. Gen- 
erally, if the missing-data mechanism is nonignorable, we will need to specify a different 
reference table for each missingness pattern. 

In our model-based approach, the two crucial issues to be addressed are: (1) how to estimate 
the set of reference tables using well-established principles of efficient estimation; and (2) how 
to perform the imputation once these estimates are obtained. Two methods of estimation will 
be compared in Section 3.4; in Section 3.3 we briefly discuss various alternatives for imputation. 


3.3 Single, Multiple, and Mean Imputation 


Once the reference tables have been estimated, distributions for each individual’s missing 
variables given the observed ones have been completely specified. In theory, these distribu- 
tions could be used to analytically calculate correct point and interval estimates for any quan- 
tities of interest. In practice, however, these calculations are usually intractable; some other 
procedure is needed. Filling in the missing values by imputation is an attractive alternative, 
because it creates a completed dataset, which can be analyzed by complete-data methods. Little 
(1986) summarizes the strengths and weaknesses of various imputation methods; we shall only 
comment on aspects relevant to the PES. 

In current practice, each missing value is typically filled in by taking a single random draw 
from a distribution, thereby producing a simulated complete dataset, which is analyzed in the 
usual complete-data fashion. Interval estimates derived from this method will be artificially 
too precise, because they do not reflect the uncertainties of the imputation. One remedy for 
this, which is coming into use, is multiple imputation (Rubin 1987), in which each missing value 
is replaced by m random draws from the distribution. With moderate amounts of missing infor- 
mation, m = 5 draws are enough to produce efficient point estimates and adequate interval 
estimates. With rates of missing information that appear likely in the PES (typically 5 - 10 
percent or less, judging from TARO), m = 2 draws will be perfectly adequate for essentially 
all purposes. In a large-scale survey like the PES, however, even a small number of multiple 
imputations may be computationally difficult to handle. 

Since the estimates of interest in the PES are the match rates within poststrata, it is pro- 
bably more important to accurately reflect the variability of imputation for Y than for X; that 
is, itis probably more important to reflect uncertainty in overall undercount rates than uncer- 
tainty in the allocation of undercount to poststrata. Thus it may be possible to obtain ade- 
quate results by imputing a single set of X values, and then multiply imputing Y given X. Yet 
another possibility is to impute a single set of X values, and then impute the probability of 
match given X. This approach was used in TARO (Schenker 1988); it allows the imputed X’s 
and fractional Y’s to be treated like single imputations when estimating undercount rates. 
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Choosing an acceptable imputation procedure given a set of reference tables is the subject 
of ongoing research. It is hoped that the TARO approach of imputing a single value of X and 
then imputing P(Y = 1 | X) will prove to be a useful compromise between the accuracy of 
multiple imputation and the computational ease of single imputation. 


3.4 Models and Methods of Estimation 


In this section, we present two alternative procedures for modeling the missing data and 
estimating the reference tables for imputation. The two procedures are the Ignorable Maximum- 
Likelihood (IML) method and a new Nonignorable Bayesian (NB) method that should be an 
improvement over IML if the missing data are not MAR. 


3.4.1 The Ignorable Maximum-Likelihood Method 


As mentioned previously, an ignorable imputation procedure needs to specify only a single 
reference table and apply it to all missingness patterns. One naive approach is to estimate this 
common reference table © by the cell proportions observed in the CC table. The resulting 
estimate Occ is asymptotically unbiased for © if the missing data are missing completely at 
random (MCAR), that is, if the probability of missingness for each item is completely indepen- 
dent of the data values, observed or missing. If the missing data are merely MAR, and not 
MCAR, then using Occ for imputation introduces biases into the data. Moreover, even when 
the data are MCAR, O¢c is not efficient because it does not make use of all of the observed 
data to estimate 0. 

The IML method makes use of all the data, both in the CC table and in the suplementary 
tables, to estimate ©. The estimated value 6; is chosen to maximize the likelihood ignoring 
the missing-data mechanism (Little and Rubin 1987, Section 5.3). In general, there is no closed 
form expression for 0 mmz; \t must be obtained iteratively, for instance via the EM algorithm 
(Dempster, Laird and Rubin 1977; Little and Rubin 1987, Section 9.3). 

The EM algorithm for contingency tables is easy to implement, and the resulting maximum 
likelihood estimate 9,4, is both efficient and consistent under the assumption of ignorability; 
thus this EM procedure for IML is attractive from both computational and theoretical perspec- 
tives. When the missing data are not MAR, however, the IML method will generally introduce 
biases. Since there are good reasons to believe that the missing data in the PES are not missing 
at random, we propose a new method of estimation that makes a different assumption. 


3.4.2 Nonignorable Modeling and Nonuniqueness of the MLE 


When the missing data are not MAR, it is no longer valid to ignore the missing-data mech- 
anism; the fact that a data value is missing conveys information about its value. Hence, a model 
that reflects this dependence must include indicator variables for response, indicating whether 
data values were observed or missing. Consequently, a nonignorable model will generally 
estimate a separate reference table for each missingness pattern, or equivalently, an expanded 
reference table © with twice as many dimensions (i.e., with an additional dimension for each 
missingness indicator). 

Let R = (R;, Ro, R3, Ry) be indicator variables for whether X,, X>, X3, and Y are 
observed, respectively; for example, R; = 1 if X, is observed and R, = 0 if X, is missing. 
Consider the eight-dimensional contingency table formed by cross-classifying individuals 
by X, Y, and R, and now let © be the eight-dimensional table of cell probabilities for this 
expanded table. 
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Each individual in the survey belongs to a cell of the expanded table, but because some data 
are missing, we only observe certain margins of this table. Because R is fully observed, any 
margin involving only missingness indicators is fully observed, but a margin involving Y or 
one of the X’s might not be observed. For example, in the cross-section of the table with 
R, = R, = R; = 1 and R, = 0, we can classify individuals by X), X>, and X3, but not by 
Y; therefore we observe only the marginal totals obtained by summing across Y. 

The number of parameters in the fully saturated model for this table is 2°7JK — 1, which 
is larger than the number of observed sufficient statistics; hence the maximum-likelihood 
estimate (MLE) for 9 is not uniquely determined. In order to obtain a unique estimate for 0, 
one must impose additional structure. 

One possible way to obtain a unique MLE is to build a log-linear model for the expanded 
contingency table, with some of the higher-order interactions set equal to zero (Little 
1985; Fay 1986; Little and Rubin 1987, Section 11.6). We might try to set to zero those 
interactions that are not estimable from the data, but the formalization of this does not 
always work well in practice. For example, it may at first appear that the R, by X, interac- 
tion is not estimable, because the value of X, is never observed when R, = 0; however, 
the data may contain information about the R, by_X, interaction indirectly through another 
variable, one that is observed for some individuals having R; = 1 and some having R, = 0. 
An example of a quantity that is truly inestimable from the data is P(Y = 1 |X =3 
X, = j, X; = k, Ri = R, = R; = 1, Ry = 0), but this does not correspond to any single 
interaction term in the log-linear model parameterization. (By “‘truly inestimable’? we mean 
in Rubin’s (1974) sense that the parameter’s posterior distribution equals its prior distribu- 
tion for all priors). 

In a dataset with a complicated pattern of missingness, it is not easy to find a set of log- 
linear terms that, if set to zero, will yield a unique MLE for 90. The minimum number of terms 
that must be set to zero to produce uniqueness is 2°1JK — 1, the dimension of 0, minus the 
number of observed sufficient statistics. Even if such a minimal set can be found, it is usually 
not unique, and one is faced with the task of deciding which set of terms should be excluded 
from the model. Rather than attempting to obtain a unique MLE by placing these kinds of 
prior restrictions on the log-linear model, we will instead use a Bayesian approach involving 
the use of a prior distribution. 


3.4.3 A Nonignorable Bayesian Method 


In the Bayesian paradigm, one expresses prior assumptions about the parameters formally 
through a prior distribution. For our situation, a proper unimodal prior, when combined with 
the observed-data likelihood, produces a posterior distribution for © that can yield a unique 
estimate; for example, we may take the posterior mode, One. as our estimate of 0. This 
method is attractive because it automatically allows precise estimation of those functions of 
© about which the data contain much information, while using the prior to select appropriate 
values for those quantities that are strictly inestimable from the data. If applied properly, this 
method will produce a nonignorable model that fits the data as well as any other model — it 
essentially maximizes the likelihood function, and yet is as consistent as possible with our beliefs 
about the nature of the missing-data mechanism as expressed in the prior distribution. 

Sound scientific practice suggests that we should choose a prior distribution that favors 
simple structure (i.e., small higher-order interactions) over complicated structure (/.e., large 
higher-order interactions). If we choose a prior that assigns a low (but nonzero) a priori prob- 
ability to the presence of higher-order interactions in the log-linear model, then we will be 
making assumptions that are similar in nature to the assumptions of the IML method — that 
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missing values are not radically different from their observed counterparts in their relation- 
ships with other observed variables — although in a smoother, more systematic fashion than 
the IML method does. 

Following the notation of Bishop, Fienberg, and Holland (1975), consider the saturated log- 
linear model for the eight-way contingency table for R, X, and Y, 


log6e ip = Bot ig + bag bh & stb s@) 
+ py2(i7) + Hi3(iky T -- 
+ 11123. ..8(ijk...p)» (2) 


where 6x.» is the probability that an observation falls in cell 7k. . .p, and the p’s are the one- 
way, two-way, three-way, and higher-order interactions. We propose the simple family of 
independent normal prior distributions 


u; ~ N(0,07) 
Bij e N (0,02/7) 
bik ~ N(0,07/77) 
ere N (OLG rahe (3) 


for some choice of 0? > O andr > 1. This prior distribution pulls the higher-order interac- 


tions toward zero, and hence pulls the estimate of 8 toward a more parsimonious or simpler 
model. We believe that this approach will produce estimates of 0 that are not too different 
from O,, when the missing data are truly MAR, but will be more robust than the IML 
method under departures from MAR. The only cases when IML will be superior occur when 
the missing data are MAR and strong higher-order interactions exist among the X’s and Y. 

Leonard (1975) and Laird (1978) examined log-linear models with normal prior distribu- 
tions on the » terms for complete data; our situation is complicated by the fact that only cer- 
tain margins of the eight-way table are observed. Finding the posterior mode Ons under this 
model is conceptually straightforward; the EM algorithm can be applied to the posterior dis- 
tribution of 0, just as to the likelihood function. The E-step remains the same; the M-step, 
however, poses some computational difficulties. The posterior distribution is nearly a ridge 
in high-dimensional space; it is very steep in certain directions, but nearly flat in others. The 
second-derivitive matrix is nearly singular along this ridge; hence Newton-Raphson and other 
gradient methods for maximization will not work well. Difficulty arises as 0” becomes large, 
because the ridge becomes flat as 07 — oo and a unique mode no longer exists. Difficulty also 
arises as the number of observations grows, because the posterior becomes very steep in cer- 
tain directions and thus portions of the second-derivitive matrix become very large. More work 
is needed to develop effective methods for finding or approximating Oj,p,. 


3.4.4 A Numerical Example 


We now present a simple numerical example and compare the results obtained from the IML 
and NB methods. For simplicity, we will only use a single dichotomous X variable (taking values 
0 or 1) and match status Y. 
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If there were no missingness, the data could be fully cross-classified by _X and Y and hence 
summarized inasingle2 x 2 contingency table. With four patterns of missingness, however, 
the data are summarized in a CC table and three supplementary tables (Figure 1). 

The CC estimate Occ is simply the observed proportions in Table A. The IML estimate 
07m is found iteratively via the EM algorithm; using O¢¢ as the starting value, the algorithm 
converges in approximately four cycles. The NB estimate One was found using a prior distri- 
bution with o? = 10 and 7 = 3. This means that the one-way terms are a priori normally 
distributed about zero with variance 10, so there is a 95 percent probability that the log-odds 
for each main effect lies inside the interval (—4 V10, + 4/10). The two-way terms have 
variance 10/3, the three-ways have variance 10/9, and the four-ways have variance 10/27; this 
represents a moderate pulling of the higher-order terms toward the origin. (Finding Oy, for 
varying values of o? and 7 proved difficult, because of the numerical instability of the par- 
ticular maximization routine applied at each M-step.) The values of 9,7; and Ow, are given 
in Figure 2. The expected imputations under these models are given in Figure 3, along with 
the expected imputations under O¢¢ for comparison. 

The differences between the imputation methods can be seen most clearly by comparing 
the expected imputations for Table D. Imputation using Occ simply reproduces the propor- 
tions observed in Table A. Imputation using 6,,,, differs from imputation using O¢c because 
Tables B and C, as well as Table A, contribute to the estimation of 8 and hence to the imputa- 
tion for Table D. 

Imputation using Oyp, is fundamentally different from imputation using O¢c or Oyyy, in 
that it assumes missingness is informative. From Table B, it surmises that missingness of Y 
is associated with X = 0. From Table C, it surmises that missingness of X is associated with 
Y = 0. It then combines this information in a smooth fashion to conclude that a larger pro- 
portion of the individuals who have both X and Y missing fall into the (XY = 0, Y = 0) 
category. 


4. DISCUSSION 


Our work is clearly at an early stage of development. Nevertheless, we feel that it has impor- 
tant potential applications, both specifically to the estimation of undercount using a PES, and 
generally to contingency table modeling when some data are missing. We conclude with two 
brief comments: first, on the need for continuing research on these procedures; and second, 
on the need to judge the relative propriety of models when devising an imputation procedure. 


5 GN OY 
xX = xX =0 Sra 
Table A Table B Table C Table D 
Complete Cases Y missing X missing Both missing 


Figure 1. Observed Data 
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4.1 Continuing Research 


Two kinds of research efforts are needed before our NB method can become broadly 
applicable. First, computationally-oriented research is needed to address the ridge-like posterior 
distribution. Alternatives to the mode, such as the posterior mean, are worth considering. Fur- 
thermore, measures of uncertainty should also be calculated, and considering the odd non- 
normal shape of the posterior, these may not be simple to summarize or compute. One strategy 
focuses directly on drawing multiple values of 6 from this posterior distribution without 
explicitly finding the posterior mode or the mean; these draws of © may be used to multiply 
impute the missing data. 


Related to the issue of measuring uncertainty is the issue of performance in repeated sampling 
experiments. Although we believe our Bayesian approach is fully appropriate, it is important 
for broad application to evaluate the operating characteristics of this procedure in the 
wide range of circumstances to which it might be routinely applied. For example, how 
well does it work in realistic cases when, unknown to the data analyst, the missing data are 
MAR? 

These topics will be the focus of a major continuing research effort. 


4.2 The Need to Judge the Relative Propriety of Models 


Considering the fully saturated model for (X, Y, R) with parameter 6, any method 
of imputation, no matter how illogical, can be viewed as the correct procedure under some 
model. For example, consider imputation using Bcc as the reference table for all missingness 
patterns. This posits conditional distributions for the missing data, given the observed data 
and R, about which there is no information in the observed values. Hence, coupling these 
distributions with the estimable distributions (the distributions of R and the observed 
data) implies an estimate for 8, which maximizes the likelihood under the saturated model! 
It is not a very sensible answer, since it corresponds to the unique MLE under a model in 
which all sorts of conditional distributions given various missingness patterns R are equal 
to the conditional distributions given R = (1, 1, ..., 1); however, if we consider the likeli- 
hood function only, there is no reason to prefer any other maximum-likelihood estimate to 
this one. 


Even stranger methods of imputation, such as ‘‘impute all missing values as zero,’’ corres- 
pond to particular models with estimated 9’s that are MLE’s under the saturated model, but 
they violate good sense. Any sensible attempt to impute missing data values is based on the 
belief that two individuals with similar values of observed characteristics, and similar miss- 
ingness patterns, are not radically different in those characteristics that are observed for one 
and missing for the other. Our NB method formalizes this notion of smoothness by specifying 
a contingency table model with small higher-order interactions. 


Choosing one imputation procedure over another, then, cannot be done on maximum- 
likelihood-type principles alone, but must involve consideration of the propriety of the 
underlying prior specifications. This is not really a serious problem; sound statistical practice 
has always advocated the use of smooth or parsimonious models when less smooth models 
fit the data equally well. Consider fitting straight lines or polynomial curves through a collec- 
tion of data points; simpler models are preferable to complicated ones on scientific grounds 
— the same issues arise in imputation. We believe that the model, given by (2) and (3), 
underlying our NB method, will be reasonable in many problems, just as linear regression is 
a reasonable tool in many problems. 
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The Sources of Census Undercount: 
Findings from the 1986 Los Angeles Test Census 


DAVID J. FEIN and KIRSTEN K. WEST! 


ABSTRACT 


This paper presents results from a study of the causes of census undercount for a hard-to-enumerate, 
largely Hispanic urban area. A framework for organizing the causes of undercount is offered, and various 
hypotheses about these causes are tested. The approach is distinctive for its attempt to quantify the sources 
of undercount and isolate problems of unique importance by controlling for other problems statistically. 


KEY WORDS: Census; Undercount; Coverage improvement; Post enumeration survey. 


1. INTRODUCTION 


In the last decade or two the need to better understand the causes of undercount in the U.S. 
census has become pressing. As the census has become an increasingly important tool in gover- 
ning the nation, conducting business, and monitoring social change (Citro and Cohen 1985; 
Clogg et al. 1986), public concern about the quality of census data has intensified. Much of 
this concern has arisen because it is perceived, with good foundation, that net census under- 
count disproportionately affects the economically disadvantaged members of society (Citro 
and Cohen 1985, ch. 5; Ericksen 1983). Representatives of the disadvantaged believe that as 
a result their constituents are being denied a fair share of public funds and political represen- 
tation (Choldin 1987). 

Assuming that an acceptable method could be found, one solution to the problem would 
be to correct the census for the bias due to differential undercount. In the fall of 1987, how- 
ever, the Department of Commerce decided not to adjust the 1990 census but instead to con- 
centrate on achieving a more complete enumeration (Ortner 1987). 

Improving census coverage implies a need to understand the causes of census undercount 
better than ever before. Many special coverage improvement programs were implemented in 
the 1980 census, and these may have contributed to the achievement of historically low levels 
of overall net coverage error. In spite of such efforts, wide socioeconomic coverage differen- 
tials have persisted. In response, the Census Bureau has embarked on a broad research pro- 
gram to identify the causes of undercount, concentrating on population subgroups that are 
especially difficult to enumerate. 

This paper presents results from a study of the causes of census undercount in a hard-to- 
enumerate, largely Hispanic area in Los Angeles. The approach is distinctive for its attempt 
to quantify the sources of undercount and isolate problems of unique importance by control- 
ling for other problems statistically. 

Though the putative inequities mentioned above result from net census coverage error (omis- 
sions less erroneous enumerations), to keep the analysis manageable only census omissions are 
investigated here. Omissions in the U.S. census deserve a higher position on the research agenda 


! David J. Fein and Kirsten K. West, Undercount Research Staff, Statistical Research Division, U.S. Bureau of the 
Census, Washington, D.C. 20233. The views expressed in this paper are those of the authors and do not necessarily 
reflect those of the Census Bureau. 
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because they are more numerous, vary more systematically with socioeconomic characteristics, 
and have been more politically controversial than erroneous inclusions. 

The paper begins by describing a system for classifying the causes of undercount. Methods 
and results are presented next. A concluding discussion summarizes the implications for cov- 
erage improvement. 


2. RESEARCH MODEL 


The research model is presented in Figure 1. It represents undercount as a problem that occurs 
primarily at the household, rather than the individual, level. This specification is consistent 
with the basic sources of undercount in a census based on contacting each household rather 
than every individual in the population. 

Three different household-level undercount problems are distinguished in the top margin 
of Figure 1: the omission of an entire household due to failure to enumerate a physical housing 
unit, the omission of an entire household in an enumerated housing unit, and the omission 
of only some members in a household where others are enumerated. Each of the three under- 
count problems can originate in census operations, in the society being enumerated, or in an 
interaction between operational and social system features. The following discussion is restricted 
to errors associated with the mailout/mailback methods used in the 1986 Los Angeles test census 
for a largely low income, Hispanic population. 


2.1 Implementation of Census Operations 


Operational difficulties during the census can cause the omission of housing units, of 
households in enumerated units, and of individuals in enumerated units. Occupied housing 
units can be missed because they are never added to the address lists or because they are on 
the lists but are erroneously deleted (U.S. General Accounting Office 1980). Given that a 
housing unit is correctly listed, all of the persons living in that unit may still be missed by the 
census due to misclassification of occupied units as vacant during nonresponse followup (U.S. 
Bureau of the Census 1987b; Ericksen 1983). 

For questionnaires which households complete and mail back there are relatively few pro- 
cedures for detecting missing persons. Procedures aimed at improving within household cov- 
erage include a question asking respondents if they were uncertain about including anyone and 
a clerical consistency check between a roster of household members requested at the begin- 
ning of the questionnaire and the number of persons for whom data are provided later on in 
the form (U.S. Bureau of the Census 1987b; Edson 1987). These procedures ‘‘cause’’ within 
household omission if they do not operate as intended due to errors in the administration of 
edit followup. Similarly, errors by enumerators during mail nonresponse followup may result 
in failure to add persons who should have been added. 

Another important census operation is public information. Census publicity programs are 
designed to motivate mail response and reduce deliberate concealment by educating people 
about the uses of census data, the importance of complete reporting, and the confidentiality 
of census records. The extent to which such programs can reduce within household omission 
is unknown. 


2.2 The Social System 


At each stage of the census, data collection procedures come into contact with a social system 
which has many attributes that can impede enumeration. These attributes include unwillingness 
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Figure 1. Research Model 
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to report some or all household members, inability to report in a manner consistent with census 
definitions, and low ‘‘social visibility’’ of household members or the housing units in which 
they live. (Social visibility is the degree to which household members and housing units possess 
characteristics which make them perceptible to outsiders.) 

The most important social system factors causing housing unit omission are those affec- 
ting the social visibility of units. Some kinds of units are easier to find and more likely to appear 
on commercial address lists that others. Social system sources of omission for households in 
enumerated units include factors depressing the visibility of household members and refusal 
to report. 

All three broad sets of social system causes are implicated in within household omission: 
unwillingness to report, definitional problems, and the differential social visibility of household 
members. Willingness to report can be approached by considering the perceived costs and 
benefits of reporting for respondents (Dillman 1978). There has been much discussion of the 
perceived costs of census reporting. People may fear that disclosure of adult males will jeop- 
ardize welfare eligibility, that persons illegally in the country will be deported, that reporting 
more persons than allowed by a lease will prompt landlord troubles, and that police will be 
informed of the whereabouts of lawbreakers (Bailar and Martin 1987). Such fears may cause 
noncompliance when there is disbelief in the Census Bureau’s promise of confidentiality. 

The sources of definitional error are quite different from those of concealment. Definitional 
errors arise in the complexities of household living arrangements, as conditioned by 
respondents’ abilities to understand and apply census enumeration and residence rules (Hainer 
et al. 1988). 

Having mentioned some of the major sources of undercount, we will now examine the extent 
to which they occurred during the 1986 Los Angeles test census. 


3. METHODS 


3.1 Data Sources 


This study takes an intensive look at undercount in a March 1986 test census conducted in 
the northern half of Los Angeles County. The population was low income and largely Hispanic. 
Nearly two-thirds (65%) of the heads of households enumerated in the census were of Spanish 
origin and 13% were Asian. Residences in this part of Los Angeles were largely single family 
dwellings (73%) and small apartment buildings (15%). Owners lived in half (51%) of the 
occupied units, in contrast with nearly two thirds (65%) of all occupied units nationwide (U.S. 
Bureau of the Census 1987a: 106, table 18; U.S. Bureau of the Census 1987c: 712, table 1285). 

The data analyzed are from the 1986 Los Angeles test census itself; the Post Enumeration 
Survey, or PES, conducted to measure test census coverage; and a special followup to the PES- 
the Causes of Undercount Survey. The census enumerated 109,900 housing units and was 
intended primarily as a test of planned 1990 census operations. 

The Post Enumeration Survey (PES) was one of these operations. The purpose of the PES, 
conducted in July 1986, was to identify census omissions and erroneous enumerations 
(Diffendal 1988). It did this by attempting to match PES to census records. When a PES 
person’s record was found in the census it was termed ‘‘matched’’; otherwise the person was 
considered ‘‘nonmatched’’. 

Three kinds of PES households are distinguished here, depending on whether all, some, 
or none of their members were matched to the census. ‘‘Complete match’’ households con- 
tain only persons in the PES who were matched to persons in the census. ‘‘Partial nonmatch’’ 
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households contain at least one person who could not be matched and at least one person who 
was matched to the census. ‘‘Total nonmatch’’ households include only persons who could 
not be matched to the census. 

These three household types are distinguished to allow examination of problems associated 
with housing unit omission, omission of entire households in enumerated units, and omission 
of persons from households that were partially enumerated. Completely matched households 
are included for reference purposes, to represent households correctly enumerated in the census. 

A special followup survey - the Causes of Undercount Survey - was conducted in November 
1987 to obtain additional information needed to compare these household types. The survey 
obtained information on census characteristics for nonmatched persons, as well as some new 
household and housing unit data not available on the census or PES files. 

The entire partial nonmatch stratum and nearly all households in the total nonmatch stratum 
were selected for reinterview. Eight total nonmatch households had to be omitted because 
several items needed to reinterview them were missing. Households in the complete match 
stratum were subsampled to reduce survey costs. 

The distribution of the 966 completed Causes of Undercount Survey interviews by household 
type is shown in the right-most column of Table 1. This table also gives the unweighted numbers 
for all 5814 PES households and the 1420 cases in the Causes of Undercount Survey sample. 
The overall response rate for the survey was 68%, reflecting considerable success in locating 
households in a transitory urban area despite the 16 months intervening between the survey 
and the PES. 


3.2 Analysis Plan 


There are several parts to the analysis. PES total nonmatch households are examined first. 
Two sets of comparisons are made: 1) of missed housing units with enumerated housing units 
and 2) of missed households in enumerated units with enumerated households. Missed housing 
units were expected to contain a higher percentage of clustered housing units and unusual unit 
types and locations than enumerated units. Missed households in enumerated housing units 
were expected to be smaller, contain adults who were less frequently at home, and move more 
often than enumerated households. Most of the explanatory variables for housing unit and 
household omission were obtained either from the census Address Control File or from the 
PES matched file, and thus are available for all 193 total nonmatch households in the sample. 


Table 1 


Numbers of Households in the PES and Causes of Undercount Survey Sample, 
and Numbers of Completed Interviews, by Household Type. 


Post Causes of Undercount Survey 
Household Type Enumeration 
Survey Sample Completed 
Interviews 
Complete Match 4,871 489 382 
Partial Nonmatch 738 738 484 
Total Nonmatch 205 193 100 


All Types 5,814 1,420 966 
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The second part of the analysis compares partial nonmatch with complete match households 
to identify factors responsible for within-household omission. Two sets of explanatory factors 
are distinguished, those indicating inadvertent or ‘‘definitional’’ errors and those represen- 
ting reasons for deliberate concealment. Indicators for definitional errors include large size 
and complex composition of households, poorly-spoken English and educational deficits. Con- 
cealment indicators include presence of recent immigrants, welfare recipiency, crowded 
housing, and disbelief in census confidentiality. It was hypothesized that partial nonmatch 
households would score higher on the definitional and concealment indicators than would com- 
plete match households. 

The analysis begins with bivariate relationships between each of the explanatory factors and 
partial omission and then considers multivariate relationships. The source for many of these 
indicators was the Causes of Undercount Survey; hence, only data from interviewed households 
are used. 

In the final part of the analysis, characteristics of four types of individuals are compared: 
persons matched in complete match and partial nonmatch households, and those nonmatched 
in partial and total nonmatch households. Characteristics compared include age, sex, educa- 
tion, relationship to the household head, and citizenship status. 

Bivariate percentages are based on weighted data to compensate for the PES and Causes 
of Undercount Survey sampling designs, though tests for differences between these percen- 
tages used unweighted numbers. Unweighted data were used to estimate parameters of log- 
linear models. The effects of the PES sampling design on estimates for the final models were 
evaluated by adding in all two-way interactions which included the PES stratification variable. 
This adjustment did not greatly change the results; thus, the estimates presented here do not 
include the stratification variable. Because the second stage of PES sampling entailed cluster 
sampling of households in census blocks, the standard errors calculated are likely to 
underestimate the true sampling errors: they are presented only as rough guides to the 
significance of parameters. 


4. FINDINGS 


4.1 Total Nonmatch Households 


Table 2 shows the final status assigned in the census to PES total nonmatch households for 
cases sent and not sent to nonresponse followup. Of the 193 total nonmatch cases 97, or 50%, 
never appeared on the census address lists. Thus, housing unit omission appears to explain 
why the PES could not find anyone in these households in the census. 

The remaining 96 cases did appear on the census address lists. What caused these households 
to be missed? The explanation is probably that most of these units were census closeout inter- 
views, where a landlord or neighbor provided only an estimate of the total number of persons 
in the household and not detailed information for individuals. This hunch is supported by the 
finding that of the 44 cases the census classified as occupied, population counts for 37 were 
‘“soldplated’’. This means that the final count accepted for these households was not obtained 
in the usual manner by allowing the FOSDIC (Film Optical Device for Input to Computers) 
machines to count persons. Instead, goldplating involved accepting a total count for the 
household entered on the questionnaire in the field. This is likely an indication that the 
household was a closeout case. 

Thus, the census really did not miss most of these 44 households entirely, though when it 
came time for PES matching, there were no individual census person records to be matched. 
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Table 2 


Final Status Assigned in the Census to PES Total 
Nonmatch Households By Nonresponse Followup Status: 
Numbers of Units 


Sent to Nonresponse Followup? 
Final Status of Unit in Census 
No Yes Total 


Omitted from the Census Address Lists 97 0 97 
Included in the Census Address Lists 4 92 96 


Occupied, Direct Accept> 1 
Occupied, Gold-plated¢ 2 
Vacant, Direct Accept 1 34 35 
Vacant, Gold-plated 0 


All Units 101 92 193 


Notes: 4 N’s are unweighted. 
b Direct Accept: FOSDIC person count accepted. 
¢ Gold-plated: Field counts accepted instead of FOSDIC. 


An allowance is made for these cases in the dual system estimation method. Nevertheless, it 
still is true that these households were not directly enumerated. 

To summarize, 50% of the PES total nonmatch households were in units which appeared 
to have been entirely omitted. Of the households living in units which were enumerated, 54% 
had been classified as vacant, possibly erroneously, and 46% had been found to be occupied. 
Of the total nonmatch households classified as occupied in the census, up to 84% may have 
been enumerated in closeout interviews. 

Figure 2 compares some physical characteristics of units left off the census address lists (light 
bars) with units that were not left off the lists (dark bars). The top set of bars represents the 
basic types of housing units. Attached single family homes, such as duplexes, appear to have 
been a major problem in the L.A. test census. Thirty-four percent (34%) of the missed units 
fell into this category, in contrast to only 8% of enumerated units. Missed units were less likely 
than enumerated units to be detached single family homes or apartments in large buildings, 
suggesting that the census was more successful at finding such units. 

Whether or not an interview was completed, Causes of Undercount Survey interviewers were 
asked to record when units they visited fit any of several ‘‘unusual unit’’ categories listed on 
the front of their questionnaires. The bottom half of Figure 2 shows that the interviewers iden- 
tified a higher percentage of unusual units among units that were missing from the census 
address lists, 28%, than among units that were included, 7%. Unit types found to be particular 
problems were abandoned-looking buildings and secondary units on a lot. 

Physical characteristics of units thus do appear to affect their visibility during census address 
list development. What might cause households to be missed in units that were enumerated? 

Households may be more easily missed if they are small and mobile. Figure 3 compares 
characteristics of total nonmatch households in enumerated units with a combined group of 
complete match and partial nonmatch households - that is, households which were 
enumerated. Households missed in the test census (light bars) were on average considerably 
smaller than those where some or all members were counted (dark bars). Whereas 53% of the 
total nonmatch households in enumerated units had one or two members, only 35% of the 
enumerated households were this small. 
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Figure 2. Physical Characteristics of Enumerated and Missed Housing Units 
(Weighted Percentages) 
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Figure 3. Characteristics of Enumerated Households and Total Nonmatch Households in 
Enumerated Units (Weighted Percentages) 
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Indicators of the propensity to move include home ownership and actual household mobility 
in the four months between the census and the PES. Households missed in the census were 
more likely to be renters and movers (61% and 8%, respectively) than were enumerated 
households (46% and 0%, respectively). The percentage of households in which all adults were 
employed full-time in March 1986 was greater by 12% for omitted households than for 
enumerated households, though the number of interviews for omitted households was too small 
for this difference to be statistically significant. 

These results support the hypothesis that missed housing units and households missed in 
enumerated units possess attributes which reduce their visibility during a census. 


4.2 Partial Nonmatch Households 


From total nonmatch households, the focus shifts to the factors associated with partial 
household omission. In this phase of the analysis, 484 partial nonmatch households were com- 
pared with 331 complete match households. Single person households were excluded from the 
382 complete match households in the Causes of Undercount Survey sample, since they were 
not at risk of partial omission. 

Two different sets of explanatory factors were considered. The first represents household 
characteristics thought to be associated with definitional errors, described earlier as errors 
resulting from inconsistencies between household membership as understood by the Census 
Bureau and by census respondents. The second set of indicators represents factors thought 
to be associated with the deliberate concealment of household members. 
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Figure 4. Definitional Error Indicators for Partial Household Omission: Households with 
2+ Persons (Weighted Percentages) 
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Definitional Errors 


Indicators for definitional errors include household size and composition, English language 
ability, census respondent’s education, and edit followup status. Larger households, those con- 
taining more distant relatives and persons unrelated to the household head, those speaking 
a language other than English at home, those where the census respondent’s education was 
low, and households not sent to edit followup were all expected to be at greater risk of defini- 
tional errors. 

Figure 4 supports these hypotheses. It shows that partial nonmatch households (light bars) 
were considerably larger than complete match households (dark bars): 45% of the partial non- 
match households but only 19% of the complete match households contained six or more 
members. Whereas 40% of the partial nonmatch households contained only nuclear relatives 
of the household head, fully 72% of the complete match households were nuclear. Partial non- 
match households were less likely to have been sent to edit follow-up by a slight, but statistically 
significant, amount. Partial nonmatch households were more likely to speak a language other 
than English at home (83%) than were complete match households (64%). Finally, census 
respondents from partial nonmatch households had less formal education than those from com- 
plete match households: 36% of the census respondents from partial nonmatch households 
had not attended high school, in contrast with 24% of the respondents from complete match 
households. 

Log-linear models were fitted to see whether these differences persisted at the multivariate 
level. The dependent variable in these models was partial household omission, with complete 
match households coded as 0 and partial nonmatch households coded as 1. Interactions between 
partial omission and each of the independent variables in Figure 4 were tested in a series of 
nested models. All two-way interactions among independent variables were included in each 
model as controls. 

In the multivariate analysis, significant interactions with partial omission were found for 
all definitional error indicators except census respondent’s education. Table 3 presents the chi 
square (Wald) statistics associated with the final definitional model, which excludes census 
respondent’s education. Significant interactions of household size with composition and 
language other than English were also detected. Parameter estimates in Table 4 show the effects 
to be in the directions expected. Estimates for standardized parameters, obtained by dividing 


Table 3 


Chi Square Statistics For Testing Two-Way Interactions 
in the Final Definitional Error Model@ 


Interactions with... 


Variables i 

aa bale Size Composition Edit Panguage 

Followup at Home 

Partial Omission Bole * azeo** 6.33 an 
Size - i220" * 9 50.052 
Composition - - 1.6 1.3 
Edit Followup - - - 1.0 
ves Hoye AUN 
ape 05 


a Log Likelihood X? = 42.2, df = 45, p = .5922. 
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Table 4 


Parameter Estimates for Interactions Between Definitional Error Indicators and 
Partial Household Omission in the Final Model 


Marginals with Partial Parameter Standard Standardized 
Nonmatch Household and... Estimate Error Parameter Estimate 


Household Size: 


2-3 Persons — 34 .06 —5.7 

4-5 Persons — .02 205 — .4 
Composition: 

All nuclear — .36 .06 — 6.0 

All non-nuclear 2, .09 2.4 
Edit Followup Status 

Not sent PS .10 DES 
Other Language at Home? 

Yes 10 .05 220 


parameter estimates by their standard errors, indicate that the effects of size and composition 
are about the same in magnitude and that both are larger than the effects of edit followup and 
language spoken at home. 


Concealment Indicators 

Factors hypothesized to cause concealment of household members by census respondents 
include: fear that persons illegally in the country would be deported, fear that disclosure of 
adult males would jeopardize welfare aid, and concern that reporting more persons than allowed 
by a lease would bring landlord troubles. Indicators for these factors were, respectively, whether 
the household contained recent immigrants, defined as persons entering the country in or after 
1980; whether anyone in the household was receiving welfare during the census month; and 
the average number of persons per room in the household. Nonresponse to the census mailout 
was also included as a general indicator of failure to perceive positive benefits from respon- 
ding to the census. Finally, belief in census confidentiality was included to see whether it helped 
to reduce fears resulting in concealment. 

Figure 5 shows that all of these indicators were related to partial omission at the bivariate 
level. For example, recent immigrants were present in 26% of the partial nonmatch households 
(light bars), but only 12% of the complete match households (dark bars). Whereas 24% of 
the partial nonmatch households reported receiving welfare, only 15% of the complete match 
households did so. Partial nonmatch households were considerably more likely to exhibit 
crowding: 63% contained more than one person per room, in contrast to only 34% of the com- 
plete match households. Partial nonmatch households were also somewhat less likely than com- 
plete match households to have returned their census questionnaires by mail or to believe in 
census confidentiality. 

Again, loglinear models were fitted, with partial omission as the dependent variable and 
the concealment indicators as independent variables. All two-way interactions with household 
size were included as controls, since other things being equal, larger households would be more 
likely to exhibit crowding and contain recent immigrants than small ones. 

This time, two variables did not survive preliminary testing: mail nonresponse and belief 
in census confidentiality. Before completely dropping the confidentiality variable, tests were 
performed to see if interactions of partial omission with presence of immigrants, welfare reci- 
piency, and crowding depended on belief or disbelief in confidentiality. Belief in confiden- 
tiality was not found to affect these relationships. 
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Figure 5. Concealment Indicators for Partial Household Omission: Households with 2+ 
Persons (Weighted Percentages) 


Table 5 


Chi Square Statistics For Testing Two-Way Interactions 
in the Final Concealment Model@ 
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a Log Likelihood X? = 103.8, df = 150, p = .9985. 
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Table 5 shows that three of the remaining concealment variables immigrants, welfare, and 
crowding interacted significantly with partial household omission in a model which included 
all two-way interactions with size and all two-way interactions among independent variables. 
Standardized parameter estimates (see Table 6) suggest effects of roughly equal magnitude for 
the three indicators. 

It is noteworthy that the relationship between partial omission and size vanished when 
crowding was included (see Table 5), suggesting that the effects of size were due to its associa- 
tion with crowding rather than scale alone. Crowding was also strongly associated with the 
presence of recent immigrants. 


4.2 Person Characteristics 


For the final part of the analysis of individual-level characteristics associated with under- 
count, four kinds of persons were compared: persons the census counted in complete match 
and partial nonmatch households, and persons the census missed in partial and total nonmatch 
households. 

Figure 6 shows differences between the percentages in 10 year age groups for persons in com- 
plete match households and each of the three other groups. It shows an excess in the 20-29 
year old group for persons missed in partial and total nonmatch households relative to persons 
in complete match households. There is also evidence of an excess in the 20-29 year age groups 
for persons who were enumerated in partial nonmatch households. 
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Figure 6. Excess Weighted Percentage in Age Group Relative to Persons in Complete 
Match Households 
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Table 6 


Parameter Estimates for Interactions Between Concealment Indicators and 
Partial Household Omission in the Final Concealment Model 


pee ne ene pen a ee oe ee ee ——— ee 
Marginals with Partial Parameter Standard Standardized 
Nonmatch Household and. . . Estimate Error Parameter Estimate 


ee 


Recent Immigrants: 


Immigrants Present AD .06 3.2 
Welfare Recipiency: 

Receiving Aid AY .OS 3.4 
Crowding: 
< .5 Persons/Room — .49 aA — 3.8 
.5-1.0 Persons/Room — 01 .08 —.1 
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Table 7 


Percentage Distributions for Characteristics of Individuals by 
PES Match Status and Household Type 


a 


PES Match Status 


Matched in Nonmatched in 
Characteristic ee —_—<—<———————— 
Complete Partial Partial Total 
Match Nonmatch Nonmatch Nonmatch 
HHs HHs HHs HHs 
Sex 
Male 46.2% 50.6% 54.2% 48.2% 
Female 53.8 49.4 45.9 51.8 
Unweighted n 1667 2564 1324 582 
Education 
No Formal Education 10.2 10.9 1720 14.3 
Less than High School 30.7 34.4 2a 37.) 
Some High School 20.5 20.6 19.5 19.5 
High School Graduate 38.6 34.1 36.4 28.8 
Unweighted n j Obey 1560 599 Bi 5) 
Relationship to Head 
Nuclear Relative 86.1 83.2 63.6 85.9 
Non-nuclear Relative hg} 12.6 DS) (2) TY 
Non-relative 2.6 4.2 11.0 7.0 
Unweighted n 1659 2560 1359 590 
Citizenship 
Citizen Since Birth 66.2 555 52.6 50.4 
Naturalized Citizen oF 9.5 6.4 6.4 
Noncitizen 24.6 37.0 41.0 43.2 


Unweighted n 1223 1567 612 316 
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Persons missed by the census in partial nonmatch households were slightly more likely than 
persons in complete match households to be males and have no formal schooling, and less likely 
to be citizens or close relatives of the household head (Table 7). Persons missed by the census 
in total nonmatch households were also slightly more likely to be noncitizens and lower in educa- 
tion than persons in complete match households, but displayed no differences in sex and rela- 
tionship to household head. Thus, on the whole, persons missed in partial nonmatch households 
differed from those in complete match households in more ways than did persons missed in 
total nonmatch households. 

In addition to biasing more census characteristics, partial household omission caused the 
omission of many more persons than did total household omission. Two thirds (67%) of all 
PES nonmatch cases were in partial nonmatch households and only one third were in total 
nonmatch households. Fully 82% of all PES omissions were found in housing units the census 
enumerated and only 18% were in missed units. 


5. DISCUSSION 


The findings reported here support evidence from more qualitative studies that partial 
household omission is the most serious undercount problem in hard-to-enumerate urban areas 
of the United States today. As compared with total household omission, partial omission in 
the Los Angeles test census accounted for twice as many missing persons, reflected more intrac- 
table sources of error, and biased more individual-level census characteristics. 

The chief problems identified for total household omission were failure to include certain 
types of housing units in the census address lists and misclassifying occupied units as vacant. 
Housing units especially at risk of misclassification as vacant were those with households which 
were small and mobile and those in which all adults were working full-time. Experience with 
coverage improvement programs at the Census Bureau suggests that further reductions in 
housing unit omission may be possible. Such programs were responsible for adding about 10% 
of the units enumerated in Los Angeles. The Bureau adopted special precanvassing procedures 
in the test census to find units in large multi-unit structures. Considerable success in reducing 
this source of error in the test census is evident in Figure 2: none of the apartment units missed 
were in large buildings. 

The misclassification of occupied units as vacant will be more difficult to remedy. Allowing 
nonresponse enumerators more time per unit and improved training for certain kinds of 
problem households may help somewhat. Coupling these efforts with special callback pro- 
cedures for smaller and more transient households and those whose members are rarely at home 
would also help. 

It is clear that improvements at the margin of what is already a largely successful census 
operation will be expensive. Keyfitz (1979) and others have observed that the incremental costs 
from adding persons to the count soar as coverage approaches 100%. Programmatic innova- 
tions to reduce the errors observed in the 1986 test census would add to the $2.6 billion cost 
projected for the 1990 census, since the methodology to be used in urban areas will be very 
similar to the L.A. test census. 

Within-household errors will be even more difficult to address than total household omis- 
sions. The Bureau must redouble its efforts to understand the complex living arrangements 
and cognitive and/or cultural factors that condition how people perceive household member- 
ship. The findings reported here suggest that further efforts targeted to respondents for whom 
English is not a native tongue, and households containing persons only distantly related to each 
other may help to reduce definitional errors. 
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However, in light of the considerable research already performed to improve the design of 
the census questionnaire and the complex enumeration and residence rules to which the Bureau 
is bound by statute and tradition, further reductions in definitional error will require extra- 
ordinary efforts. Definitional errors are deeply embedded in cultural differences and educa- 
tional deficits among hard-to-enumerate groups. 

Within-household omission also was found to be strongly related to the presence of 
immigrants, welfare recipiency, and crowding. That a PES-based study could detect such effects 
suggests that the PES succeeded in counting many persons whose presence had been concealed 
in the census. Some of the effects of the so-called concealment variables may be due to 
uncontrolled factors other than concealment, but the persistence of relationships even after 
household composition was added in a final log-linear model (not shown) suggests that the 
PES really did detect some persons who were concealed in the census. Thus, there appears to 
be a continuum from households that are highly resistant to enumeration to those which are 
less resistant, and for the latter more intensive methods like those used in the PES may be 
effective. 

The social conditions underlying the most resistant forms of concealment present the 
most difficult problems for the Census Bureau. Public information programs attempting 
to convince people that the census is important and that census data will be kept confidential 
were not very effective for the hard- to-enumerate population in the Los Angeles test census, 
as reported by Moore and McDonald (1987), though these programs may work better 
under real decennial census conditions. The minimal role found for belief in census confiden- 
tiality, either in its own right or in mediating between household circumstances and conceal- 
ment, suggests that the relationship between attitudes and census response behavior is not a 
simple one. 

The findings reported here should not be generalized uncritically to the sources of under- 
count expected to affect urban areas in the 1990 Census. Because the data are based on a test 
census, errors may reflect inexperience with experimental procedures or failure to convince 
respondents (and census workers) that the project was as serious as the decennial census. Fur- 
ther, to the degree that Los Angeles is unlike other major urban areas, it may experience unique 
census-taking problems. For example, Los Angeles is thought to be home to more ‘!'egal aliens 
than any other major city (Heer and Passel 1987). 

On the other hand, the net undercount rate for Los Angeles in 1980 was quite similar to 
the rates for other major cities, as measured in the 1980 Post Enumeration Program (Fay ef 
al. 1988). Thus, what they lack in illegal aliens, these cities may make up in other hard-to- 
enumerate groups. Further research is needed to assess the degree to which causes of under- 
count differ by race, ethnicity, and other social characteristics. 

It is encouraging that the causes of undercount identified in this Post Enumeration Survey- 
based study were reasonably consistent with more qualitative reports by ethnographers and 
focus groups. Also, the PES estimates for undercount from the Los Angeles test census are 
believed to be of high quality (Hogan and Wolter 1988). For these reasons, extension of the 
PES-based methodology developed in this paper to other urban (and nonurban) areas is recom- 
mended. 

On the social system side, further research on how rationally people weigh the costs and 
benefits of responding to censuses and surveys would help to weigh the potential for improving 
census coverage through the Census Bureau’s public information and community action pro- 
grams. Better indicators for household-level reasons for concealment are also needed. 
Examining specific assistance programs would help to confirm the effects of welfare participa- 
tion on census coverage, since not all aid would be imperiled by revealing true household- 
composition. 
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Improved measurement of the sources of undercount arising in census operations is also 
needed. If data from census quality control programs were combined with PES matching 
results, error sources could be identified with greater precision. 
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Total Error in the Dual System Estimator: The 1986 
Census of Central Los Angeles County 
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ABSTRACT 


The U.S. Bureau of the Census uses dual system estimates (DSEs) for measuring census coverage error. 
The dual system estimate uses data from the original enumeration and a Post Enumeration Survey. In 
measuring the accuracy of the DSE, it is important to know that the DSE is subject to several components 
of nonsampling error, as well as sampling error. This paper gives models of the total error and the com- 
ponents of error in the dual system estimates. The models relate observed indicators of data quality, such 
as a matching error rate, to the first two moments of the components of error. The propagation of error 
in the DSE is studied and its bias and variance are assessed. The methodology is applied to the 1986 Census 
of Central Los Angeles County in the Census Bureau’s Test of Adjustment Related Operations. The meth- 
odology also will be useful to assess error in the DSE for the 1990 census as well as other applications. 


KEY WORDS: Nonsampling error; Post enumeration survey; Coverage evaluation, Undercount; 
Capture-Recapture. 


1. INTRODUCTION 


The dual system estimator (DSE) is used in several contexts for estimating the size of a 
population. Its applications range from wildlife populations to human populations. DSEs of 
births are used at the U.S. Bureau of the Census in the formation of the demographic analysis 
estimates of the national population. Currently, the Census Bureau intends to use DSEs for 
measuring coverage error in the 1990 Decennial Census. This paper focuses on the applica- 
tion of the DSE in the census context where the two systems are the original enumeration and 
a Post Enumeration Survey (PES). 

The obvious estimator based on the DSE of census undercoverage is UC, given by 
UC = DSE - CEN, with CEN referring to the size of the original census enumeration. Since 
DSE = CEN + UC, the DSEs also provide alternative estimates of population. A more 
general class of alternative estimates based on the DSE (Spencer 1980; 1986) is 
(1 — f) x CEN + / x DSE, or equivalently 


GEN 7 UC 


with 0 =< 7 s'1- 


Estimates of total error of the DSE are essential for determining what value of f leads to 
the most accurate estimator of population size. Since the range of values for f include 0 and 
1, the selection of either CEN or DSE is possible. The criteria for improvement of one set of 
population estimates over another may be based on measures of the quality of the distribu- 
tion of the population (Hogan and Mulry 1987; Spencer 1986). Estimates of total error in the 
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DSE are also important for statistical planning purposes, e.g., how much money should be 
spent and how big a sample should be fielded in the PES. 

DSEs are subject to several components of nonsampling error, in addition to sampling error. 
We present models of the total error and the components of error in the DSE. The models relate 
observed indicators of data quality to the first two moments of the components of the error. 
We then use techniques of propagation of error to estimate the bias and variance of the DSE. 
In doing so, we assess the total error, or the joint effect of the errors. Previous work on error 
models for the DSE includes Seltzer and Adlakha (1974). 

The methodology is applied to the 1986 Census of Central Los Angeles County, also known 
as the 1986 Test of Adjustment Related Operations (TARO) conducted in Los Angeles (Dif- 
fendal 1988). The PES in TARO comprised about 6,000 housing units and over 19,000 people. 
A sensitivity analysis shows how the component errors interact, which ones cancel, and which 
ones compound each other. The methods described here to estimate the error in the TARO 
DSE can be extended to estimate the error in the 1990 DSEs. 

We have tried to organize this paper to facilitate incomplete reading of the paper. Section 
2 introduces and presents the rationale for the TARO DSE and its major components. Our 
strategy for assessing the component errors and combining them to estimate the total error 
in the DSE is described next (Section 3). A detailed description of the DSE, with notation, is 
necessary for precise description of the component errors (Section 4). Following that descrip- 
tion is an assessment of the component errors (Section 5). A synthesis of the component errors 
leads to estimates of the total error of the DSE (Section 6). Our major conclusions are then 
presented (Section 7). 


2. DUAL SYSTEM ESTIMATOR 


The application of the dual system estimator requires assuming that there are two lists of 
the population. The first list is the original census enumeration, and the second is an implicit 
list of those covered by the sampling frame for the P sample of the PES, whom we will call 
the P-sample population. The sampling frame itself is not a list of people, but of census blocks. 

The P sample is one of the two samples that comprise the PES. The PES is composed of 
the E sample, which is a sample of census enumerations, and the P sample, which is a sample 
of the population. The E sample is selected to estimate the number of enumerations that are 
erroneous. The P sample is selected to estimate, through dual system estimation, the number 
of people missed by the original enumeration. 


Table 1 
Probabilities of Inclusion in a Cell 


Original Enumeration 


In Out Total 


P sample In Pi P12 Pir + 
Out Pi21 P22 Pi2+ 


Total Pi+1 Pi+2 Pi++ 


Survey Methodology, December 1988 243 


Table 2 


True Population Size in Each Cell 


Original Enumeration 


In Out Total 
P sample In Ni N12 Ni+ 
Out N21 (N22) (N24) 


The dual system estimator is based on a model that the probabilities that the i-th individual 
in the population of size N is in the census or not and in the P sample or not are as shown in 
Table 1 (Wolter 1986a); see Wolter (1986a) for discussion and references to earlier work. The 
true population size in each category is defined in Table 2. 

In Table 2, N, 4, = N, the total population size. Even if we could observe the N;;’s in the 
first row and first column, the N;;’s in parentheses would not be observed directly, but would 
have to be estimated from the model. The DSE of N then would have the form N,4.N41/Nj;, 
which we will refer to as the ideal DSE. 

In estimating population size for measuring census coverage error, the N’s are replaced by 
estimates from the original enumeration and two sample surveys, the P sample and the E 
sample. The survey data are weighted by the reciprocals of the selection probabilities. In the 
following definitions, the estimates with ‘‘ *’’ reflect the possible presence of nonsampling error: 


N, = the weighted number of P-sample selections 

N, = the estimate of the total population from the P sample. 

CEN = the size of the original enumeration 

IT, = the number of persons imputed 

II, = the weighted number of census enumerations with insufficient information for 
matching 

EE = the weighted number of erroneous enumerations in the original enumeration, based 
on the E sample 

EE = the estimate of the number of erroneous enumerations in the original enumeration 

a = CEN - //, - IJ, - EE = the weighted number of distinct people in the original 


enumeration from the E sample, 


@ = CEN -//, - II, - EE = the estimate of the number of distinct people in the 
original enumeration from the E sample, 


M _ = the weighted number of people in the census and the P sample 

M_ = the estimate of the number of people in the census and the P sample. 

With this notation, N, estimates N,, which unbiasedly estimates N,,. The ratio C/M is 
used to estimate the ratio N,.,/N,,. (By themselves, Cand M are not good estimators of N,, 
and N,,.) Thus, the estimator has the form N, 4 = N, C/M. The ratio C/M contains a cor- 
rection for erroneous enumerations and for cases with insufficient information for matching, 
IT, and IJ), so that cases with no chance of being included in the denominator are also excluded 
from the numerator. 
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The DSE is used to estimate the percent net undercount, or the net undercount rate, in the 
original enumeration, 


U = 100 (CEN -N,,)/Ni4.- 


For the TARO site (i.e. Central Los Angeles County) as a whole, CEN = 355,352, 


N, = 336,707, C = 343,567, M = 298,204, and N,, = 388,040. Using these numbers, the 
estimate of the net undercount rate is 8.42. 


3. STRATEGY FOR ASSESSING TOTAL ERROR 


The DSE is subject to various sources of error, including error due to incorrect addresses 
from the P sample, error due to missing data (unit and item nonresponse), response errors, 
interviewer errors, correlation bias, sampling error, efc. We wish to estimate the effects of these 
diverse sources of error on the DSE. 

The first step in our strategy is to express the DSE as a function of components. We have 
constructed the components so that, for the most part, the different sources of error act either 
independently or perfectly dependently on different components. By isolating the effects of 
the various errors, we are better able to identify the major distinct sources of error. 

Next, we estimate the first two moments of the component errors, one component at a time. 
In doing so we draw upon the results of various TARO evaluations and quality control pro- 
grams. The way we constructed the components implies that correlation between component 
errors typically equals either 0 or 1. 

To study the propagation of errors we have used computer simulation methods. A 
multivariate distribution of the error components, say /, was assumed. The specification of 
F was consistent with the first two moments as estimated in Section 5. Realizations of the com- 
ponent errors were simulated by pseudo-random draws from F and then the DSE was calculated; 
this procedure was repeated 10,000 times and the resulting empirical distribution of the DSE 
was used as an estimate of its actual distribution. The first two moments of the latter distribu- 
tion provide numerical estimates of the total error of the DSE. 

Sensitivity analysis was performed to discover the importance of using one distributional 
form for Frather than another. The results suggest that the exact distributional form (beyond 
the first two moments) is relatively unimportant (see Section 6). 

We adopted a Bayesian approach in investigating of the error in the DSE. We estimated 
the first two moments of the distributions for the error components, then we derived the 
posterior distribution of the undercount rate conditional on the observed values of C, N,, M, 
etc. 


4. COMPONENTS OF THE DSE 


The DSE is subject to sampling errors and nonsampling errors, including failure of assump- 
tions underlying the DSE model. The DSE does have a bias, but the bias in the census context 
is negligible (Wolter 1986a). Nonsampling errors may affect the accuracy of estimation of 
N.,, N,4, and N,;. Descriptions of the nonsampling error follow. 

The error in the estimation of N,, is defined by C — N,, = (C — C) + (C — N,)). 
The first term (C — C) is the net nonsampling error, which contributes to both bias and 
variance, and the second term (C — N,, ,) is the sampling error, which contributes only to the 
variance. Define the net nonsampling error asc = C — C. 
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The net error c arises during the processing of the E sample when respondents are 
misclassified as to whether they are correctly or erroneously enumerated in the original enumera- 
tion. Therefore, c has three components: c,, which occurs during the data collection and pro- 
cessing; c,, caused by a PES design that fails to balance estimates of the gross overcount and 
gross undercount; and c;, caused by missing data, c = c, + cy + c;. Sections 5.5, 5.6 and 5.7 
cover C,, Cp, and c;, respectively. 

The error in the estimation of N,4 is defined by N, — Nz = (N, — Np) + (N, - 
N,4). The first term (N, — N,) is the nonsampling error, which contributes to both bias 
and variance and the second term (N, — Nj,) is the sampling error, which contributes only 
to the variance. The net nonsampling error is defined by n, = N, — Np. 

The net error 1, arises during the interviewing for the P sample when the P-sample 
selections are not interviewed. This situation occurs when household members are fabricated 
or when there is missing data. Therefore, n, has two components: n,,, the error due to 
fabrication and 7,;, the error due to missing data, n, = npy + np;. Section 5.3 discusses Nfs 
and Section 5.7 covers npj. 

The error in the estimation of N,, is defined by M — Ni; = (M — M) + (M — Nj;). 
The first term (MZ — M) is the net nonsampling error, which contributes to both bias and 
variance, and the second term (M — N,,;) is the sampling error, which contributes only to 
the variance. 

To facilitate the description of the nonsampling error in the estimation of N,,, consider the 
following tables of P-sample selections and respondents. Entries in Table 3 are the weighted 
number of P-sample selections in each category. Entries in Table 4 are the weighted number 
of P-sample responses in each category. Entries in Table 5 are estimates of the number of people 
in each category based on the P-sample interviewing, responses, and matching operation. 


Table 3 
P-sample Selections 


Census Enumeration Status 


P-sample Selections Not 
Enumerated 
Enumerated 

Not reported Dy); Dy 
Reported 

Correct Census Day Address D>; D> 

Wrong Census Day Address D3; D3 

Table 4 


Enumeration Status of P-sample Respondents 


Census Enumeration Status 


P-sample Status Not 
Enumerated 
Eumerated 
Fabricated Ay Aj? 
Not Fabricated 
Correct Census Day Address Ap A 


Wrong Census Day Address A3) A32 
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Table 5 
Match Status of P-sample Respondents 


Match Status 


P-sample Status Not 
Matched Matched 
Fabricated By By 
Not Fabricated 
Correct Census Day Address Bo) Br 
Wrong Census Day Address B3) B37 


Since the P-sample selections who appear as reported in Table 3 are the respondents who 
are not fabricated in Table 4, D,; = A,,; and D3; = A3,. Also, A,,; = O since a case 
fabricated during the PES cannot be enumerated in the census. Therefore, 


M => Diy ar D>, Sr D3; = Dy; + Ar = A3,. 


Since a case fabricated during the PES would not have a corresponding census enumera- 
tion, we assume B,, = 0. Therefore, M = B,, + By, + B3, = By, + By. 
Then the nonsampling error in the estimation of N,;, called m, may be defined as follows: 


M-—M 
(Bip it Bob Bs ead Die Dot) 
Dy he Bog et A oa acts (Baye Aag) 


m 


The error m has three components: (B,; — A>,), which is the error introduced in the 
matching operation (Section 5.2); (B;,; — A3,), which is the error introduced by respondents 
giving the wrong Census Day address (Section 5.3); and — D,,;. D,; has two components: 
missing match status m; and fabrication m;. Section 5.7 covers missing match status, and 
Section 5.4 covers fabrication. 

The ideal DSE can be written as follows: 


Ni N41/Ny, = (C — c)(N, — 1,)/(M — m). 


5. COMPONENTS OF PES ERROR 


Estimates of the first two moments of the posterior distribution of the undercount rate derive 
from estimates of the first two moments of the components of PES error. The components 
are correlation bias, matching error, accuracy of the reported Census Day address, fabrica- 
tion in the P sample, measurement of erroneous enumerations, balancing the estimates of the 
gross overcount and the gross undercount, missing data, and sampling error. We next describe 
the source of each component of PES error and give models for each component. We model 
the component errors in terms of observable indicators of data quality. We estimate the first 
two moments of the distributions of the errors for use in the total error model in Section 6. 
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5.1 Correlation Bias 


5.1.1 Source of Error 


An important concern for dual system estimation is that the estimate of the proportion of 
the population enumerated in the census, based on the P sample, is accurate. The violation 
of one of the independence assumptions underlying dual system estimation may cause the 
estimate of the proportion of the population enumerated in the census, and thereby the estimate 
of the population, to be biased. 

Three independence assumptions are made for dual system estimator: 

Causality. The event of being included in the census is independent of the event of being included 
in the PES. That is, the cross-product ratio satisfies 


6; = Pi Pi22/Pii2 Pin = 1, fori = 1, ...,N. 


Homogeneity. The capture probabilities satisfy pj;4 = pj4 OF Dis) = Pa, fori = 1, ..., 
N, within each of the post-strata. 
Autonomy. The census and the PES are created as a result of N mutually independent trials. 

The homogeneity assumption follows combination model M,, in Wolter (1986a). All the 
development for the Peterson model M, in Wolter (1986a) also applies to model M,, when 
enough information is available to form post-strata where M, holds. 

To control heterogenity in the population the Census Bureau post-stratifies the data based 
on demographic and geographic variables, a technique originally recommended by Sekar and 
Deming (1949). An estimate of the population in each post-stratum is calculated and then all 
the estimates are summed to give an estimate of the total population. Unless the failure of the 
homogeneity assumption is severe, the estimate lies between the census and the truth. 

Research by Wolter (1986b) and Cowan and Malec (1986) has demonstrated that the failure 
of the autonomy assumption has a negligible effect on the bias of the DSE but causes an increase 
in its variance. Wolter’s formulation allows household members to act individually (autonomy) 
or together (failure of autonomy). Cowan and Malec present a model that permits clustering 
of the census misses (failure of autonomy). Next, we model the combined effect of the sources 
of correlation bias on the DSE. 


5.1.2 Definition 


For insight into the effect of correlation bias, assume all 6; = 6 and write the true popula- 
tion size as 


N = Ny + Nig + Nay + 8 (Ny2No1/Ni1), 


where 6; is the cross-product ratio defined in Section 5.1.1. 

The correlation bias affects only the last term because the other three may be estimated 
directly. The parameter 6 represents the effect of the failure of the independence assumptions. 
When the independence assumptions hold, 6 = 1. 

The correlation bias, arising when 6 does not equal 1, is the only contributor to f, the error 
due to failure of the model. The population size can be written as follows: 

N = Ni4Nyi/Niy +t 
Ni4Nai1/Ni +O — WAN i2N12/Ni1)- 


II 


II 


Therefore, the correlation bias, t = (0 — 1)(N,2N2,/Nj)). 
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5.1.3 Measurement 


The parameter 6 may be estimated at the national level for racial and ethnic subgroups using 
demographic analysis estimates of the population size. Note, however, that this technique 
presumes that the demographic analysis estimates are accurate. Even so, this formulation also 
permits varying 6 to assess the sensitivity of the DSE to the estimate of the effect of the viola- 
tion of the independence assumptions. 


5.1.4 Estimation 


Estimates for 6 were not made for the 1986 TARO because an alternate source for popula- 
tion estimates did not exist, e.g., no demographic analysis estimates were feasible. However, 
Ericksen and Kadane (1985) made three estimates of 6 for blacks for the 1980 census: 2.1, 2.7, 
and 3.7. Since the population in the 1986 TARO was predominantly minority (73 percent 
Hispanic, 12 percent Asian, and 15 percent non-Asian and non-Hispanic), the Ericksen and 
Kadane estimates for 1980 will be used in this paper: E(@) = 2.1, 2.7, or 3.7, Var(@) = 0. We 
are treating @ as fixed, but unknown. A sensitivity analysis is conducted in Section 6 to 
demonstrate the effect of alternative values of 0. 

These estimates of 6 are consistent with the reports of the participant observers in the Los 
Angeles test site (Childers et a/. 1987). Our professional judgment is that correlation bias is 
higher for urban areas than for the country as a whole. This implies that these estimates may 
be conservative for the Los Angeles test site because it was urban. 


5.1.5 Summary 


In the total error model the first two moments of the posterior distribution of the correc- 
tion factor for correlation bias are assumed to be E(@) = 2.1, 2.7, or 3.7, and Var(@) = 0. 


5.2 Matching Error 


5.2.1 Source of Error 


Matching error in this discussion refers to errors that occur in the operation where the P 
sample is matched to the original enumeration. Therefore, matching error does not encompass 
response errors that arise in the data collection. Although other types of errors may result in 
an inaccurate assignment of a P-sample respondent’s census enumeration status, these sources 
are treated in other components of error. 

After the P-sample interviewing is completed, a search of the census is conducted to deter- 
mine if the respondents are enumerated. Then the P-sample respondents are designated as 
matching an enumeration in the census or as not enumerated in the census. Errors in assigning 
the enumeration status to P-sample persons which occur during the processing of the data are 
known as matching error. Errors may occur in either direction. People may be designated as 
matching a census enumeration although they are not in the census, called a ‘‘false match,”’ 
or people may be designated as not enumerated although they are, called a ‘‘false nonmatch.”’ 
Matching error will cause a bias in the estimate of the number of people in both the census 
and the P-sample population and thereby introduce a bias into the estimates of the number 
of people missed by the census. 


5.2.2 Definition 


The denominator N,, of the dual system estimator is estimated from sample survey data, 
the P sample. The following were introduced in Section 4: 


A>, = the weighted number of people who were enumerated, 
B,, = the estimate of the number of people who match. 
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Then the net error due to incorrect classification of enumeration statuses, m,,, may be defined 
aS M,, = Bz, — A2,. The conditional expected value and variance of m,, given observed value 
M are denoted by E(m,,) and Var(m,,). 


5.2.3 Measurement 


Measurement of ™,,, is possible by processing a sample of the cases a second time /.e., by 
having highly trained personnel rematch them. The assumption underlying an independent re- 
match of a sample is that the personnel with more training make fewer mistakes in classifying 
enumeration statuses although they have the same materials and information available as the 
original workers. The original match codes and the evaluation match codes can be reconciled, 
and the discrepancies can be resolved. 

Two evaluations of the clerical matching were conducted with the 1986 TARO data. One 
study evaluated the clerical matching for movers, and another evaluated the clerical matching 
for nonmovers. 

In the evaluation of matching for nonmovers (Corby and Mulry 1988), a probability sub- 
sample of 35 blocks was chosen for a rematch by professionals from headquarters. The sample 
was stratified by match rate, and blocks with low match rates were sampled at a dispropor- 
tionally high rates so that the quality control staff could learn as much as possible about match- 
ing errors. Adjacent blocks were not searched so the false nonmatches are possibly 
underestimated. 

The second evaluation study considered matching error for movers (Childers et a/. 1987). 
There were 90 movers who were not matched in TARO, and all of these movers were rematched. 
Eleven matches were found, two of which had been lost during the computer editing. 


5.2.4 Estimation 


We now use the results of the evaluation subsamples to estimate the moments of the distri- 
bution of m,, from the PES sample. Not conducting an extended search in the evaluation for 
the nonmovers probably reduced the number of false nonmatches found. Experience with 
extended searches implies that adding an additional 20 percent of the net error of 70 (Hogan 
and Wolter 1988) is a conservative way to compensate for the lack of one. The results from 
the two evaluations yield a net error of 95 in the PES sample. Therefore, the net error rate 
is — .0055. We apply the net error rate to only the P-sample cases with a resolved match status 
because the error in the imputation for the unresolved cases is covered in the Missing Data 
Section 5.7. The expected value of m,, becomes E(m,,) = — 1831, when the overall sampling 
weight of 17 is used. 

An estimate of the variance of the estimate of net matching error for nonmovers has not 
been calculated. The sample variance of the number of errors for movers is zero because all 
the nonmatched movers were rematched. However we do not believe that the true variance 
is zero. One way to obtain a variance specificiation would be to assume that the errors occurred 
in the manner of a mixture of Poisson processes, e.g., matching errors for movers followed 
one Poisson processs and matching errors for nonmovers independently followed another 
Poisson process. Treating the errors as arising from a simple Poisson process would then lead 
to a conservative estimate of variance; in this case the variance would be estimated by 17 x 107. 
However, the Poisson model may not be conservative if the errors occur in clusters. In an 
attempt to develop conservative estimates of variance, we have (somewhat arbitrarily) multiplied 
the variance estimate under the simple Poisson model by the overall sampling weight to obtain 


Var(m,) = (17)? x 107 = 30,923. 
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5.2.5 Summary 


For the total error model, the first two moments of the posterior distribution of the net 
matching error for the PES sample are assumed to be E(m,,) = —1831 and 
Var(z): 150,923. 


5.3. Quality of the Reported Census Day Address 


5.3.1 Source of Error 


Some of the respondents in the P sample have moved between Census Day and their PES 
interview. The respondents may misreport whether they have moved during the time lapse. 
If they have moved, they may not report their previous address accurately, or their previous 
address may not be geocoded correctly by the staff. Any of these types of errors may cause 
the matching operation to search the census in an area other than where the respondent was 
enumerated. These errors may lead to assigning a nonmatch status to respondents who actually 
were enumerated because the matching operation is unable to locate their enumerations. Inap- 
propriate assignment of the status of nonmatch will cause the estimate of the number of people 
missed by the census to be biased upward. 

Circumstances under which inaccurate reporting of the Census Day address by a PES respon- 
dent will not cause a false nonmatch do exist. If the Census Day address is inside the search 
area for the reported address, and the reported address is geocoded correctly, then the match- 
ing operation will find the person. 


5.3.2 Definition 


The denominator N,, of the dual system estimator is estimated from sample survey data, 
the P sample. The following were introduced in Section 4: 


A3, = the weighted number of people with an inaccurate Census Day address who are 
enumerated, 
B3, = the estimate of the number of people with an inaccurate Census Day address who 


match at another address. 


Then the net error due to inaccurate reporting of the Census Day address, m,, may be 
defined as m, = B3, — A3,. The conditional expected value and variance of m, given the 
observed value M are denoted by E(m,) and Var(m,). 


5.3.3 Measurement 


Measurement of m, is based on a follow-up of a sample of P-sample respondents whose 
enumeration status is ‘‘not enumerated’’. Data from the follow-up are used to estimate the 
error that arises when people who were enumerated misreport their Census Day address when 
they respond to the PES. 

An evaluation of the quality of the reporting of the Census Day address was conducted after 
the 1986 TARO. A post-production follow-up which reinterviewed a sample of 903 of the non- 
matches was aimed at determining the number of nonmatches caused by misreporting mover 
status. Another search to match respondents who reported they in fact had moved within the 
test site was made at the new address. 


5.3.4 Estimation 


The sample cases found to have errors in their reported Census Day address may be used 
to estimate 
L,. = the weighted number of people who erroneously report their Census Day address in 
their P-sample interview. 
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A search of census enumerations at the newly reported addresses produces 


Tam = the estimator of the percentage of people with errors in the location of their reported 
Census Day address who match census enumerations. 


Then the expected value of the error m, is estimated by 
EGA sans 


The results of the post-production follow-up (Hogan and Wolter 1988) yielded a misreport- 
ing rate of at most 3.1 percent in the P sample. A match rate of 33 percent was estimated for 
those who misreported their Census Day address and moved within the test site. If we assume 
the match rate for those who reported a census day address outside the test site is also 33 per- 
cent, then the expected value E(m,) = —3481. 

An estimate of the variance of the error due to misreporting has not been made. Our pro- 
fessional judgment is that a conservative estimate of the variance at the PES sample level is 
900. Therefore, the variance at the TARO site level is 


Var(m,) = (17)? x 900 = 260,100. 


5.3.5 Summary 


For the total error model, the first two moments of the distribution of the error due to 
misreporting of Census Day address for the PES sample are assumed to be E(m,) = —3481 
and Var(m,) = 260,100. 


5.4 Fabrication in the P sample 


5.4.1 Source of Error 


Interviewers may fabricate people in P-sample housing units. Research has shown that 
interviewer fabrication during the PES may result in a substantial bias in the estimates of census 
coverage error based on the dual system estimator. Basically, the creation of fictitious indi- 
viduals may decrease the PES match rate, causing the estimate of coverage error to be too large. 

Experience at the Bureau of the Census has shown that fabrication of the members of a whole 
household is the problem for household surveys. Rarely is there a fabrication of the household 
member in a household where the other members are the real residents. 

The quality control operation for the interviewing phase of the P sample is designed to check 
for fabricated interviews and to interview the real household members. Therefore, no statistical 
correction for fabrication in the P sample is made in the formation of the dual system estimates. 


5.4.2 Definition 
The N,,; and N,, in the dual system estimator are estimated from sample survey data, the 
P sample. The following were introduced in Section 4: 
my = the weighted number of people who were replaced by fabricated P-sample interviews 
and who were enumerated, 
Apr = the error in Np¢ due to households that were fabricated in the P sample. 


The posterior expected values and variances of m, and vp, are denoted by E(m,) and E(np,) 
and Var(m,;) and Var("p,). 


5.4.3 Measurement 


In the 1986 TARO, the estimate of the fabrication rate based on the quality control of the 
interviewing was approximately 0.6 percent. The estimate of the fabrication rate based on a 
post-production follow-up was approximately 1.2 percent (Hogan and Wolter 1988). 
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5.4.4 Estimation 


We now estimate the moments of the posterior distributions of n,,and m; from the PES 
sample. We believe it is reasonable to assume np, is negligible in TARO. Therefore, the 
expected value and variance are given by E(m,-) = O and Var(np;) = 0. 

The quality control data may be used to estimate ry = the rate at which P-sample inter- 
views are fabricated. 

The search of the census enumerations for people in the P sample who were found by the 
quality control operation to not have been properly interviewed produces ry,, = the match 
rate for people not interviewed because their household was fabricated in the P sample. 

In TARO, records were not kept so that the people who were discovered by the quality con- 
trol not to have been interviewed properly could be identified. Therefore, no search was made 
for matching enumerations. Since we have no data available for a direct estimate of rf, we 
conservatively assume that the people not interviewed properly are like the people who were. 
We set 77, equal to the final overall P-sample match rate. 

We use the conservative results from the post-production follow-up to yield a fabrication 
rate of 1.2 percent. The match rate for TARO is 88.6 percent (Diffendal 1988). Therefore, the 
expected value of the error m,is given by E(mr) = —2502. 

An estimate of the variance of the estimate of fabrication error has not been calculated. 
Our professional judgment is that a conservative estimate of the variance can be derived by 
the reasoning discussed in Section 5.4.2. Thus, we estimate that the variance for the TARO 
site is 


Varen,) = C7)" x, 206) =s60.5346 


5.4.5 Summary 


For the total error model, the first two moments of the distribution of the net error due 
to fabricated interviews are assumed to be E(my) = —2502 and Var(my) = 59,534. The net 
error due to fabricated interviews in is assumed to be negligible, and therefore, E(mpr) = 0 
and Var("pr) = 0. 


5.5 Measurement of Erroneous Enumerations 


5.5.1 Source of Error 


Some enumerations may have been entered in the census as the result of mistakes. These 
enumerations are called erroneous enumerations. Since the dual system estimator requires 
estimating the number of distinct people captured in the census, a correction is made for 
erroneous enumerations in the estimate of total population. Subtracting the estimate of the 
number of enumerations that do not correspond to distinct people from the census count pro- 
vides an improved estimate of the number of distinct people captured in the census. This 
estimated correction is obtained from the E sample in the PES. 

The following types of enumerations are considered erroneous: (1) people who died before 
Census Day, (2) people who were born after Census Day, (3) enumerations that do not refer 
to real people, (4) people duplicated, (5) people enumerated outside the search area where the 
matching operation looks for their enumeration. The search area for a case includes the block 
for its address and the ring of adjacent blocks. 

This component is caused by errors in measuring census error. An error in the estimation 
of the number of erroneous enumerations occurs either when an enumeration in the E sample 
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is designated as erroneous although it is correct, or when an enumeration is designated as cor- 
rect although it is really erroneous. Therefore, both positive and negative error can occur in 
the estimation of the number of erroneous enumerations. 

The types of enumerations that are the most vulnerable to misclassifiction as to whether 
they are erroneous include the duplicated and fabricated enumerations. These errors are the 
only ones considered because the others are either inconsequential or are treated separately. 
Errors in identifying enumerations for people who died before Census Day and people who 
were born after Census Day have a trivial effect. Errors in classifying the eumeration status 
because a person was enumerated outside the search area is covered in Section 5.6 on balan- 
cing the estimates of the gross overcount and the gross undercount. 


5.5.2 Definition 


The bias in the DSE due to misclassification of enumeration status is caused by error in the 
estimation of N,,. In the formation of the estimate of the number of distinct people in the 
original enumeration C, a correction is made for the number of erroneous enumerations, EE. 

EE and therefore C are estimated from sample survey data, the E sample. Errors in the estimate 
C occur through the misclassification of the enumeration status of E-sample cases. Let 


Ce. = the difference between the weighted number of erroneous enumerations misclassified 
as correct and the weighted number of correct enumerations misclassified as erroneous. 


The expected value of c,, conditional on the observed value C, is denoted by E(é.) ) The 
variance of c,, conditional on the observed value C, is denoted by Var(c,). 


5.5.3 Measurement 


Processing error may be measured directly using a rematch of a sample of cases. Errors from 
other sources, such as duplications due to violations of census residency rules, can be assessed 
by viewing the frequency distributions of the erroneous enumerations. This is preferable to 
direct measurement of these errors because of the difficulties in obtaining accurate data in addi- 
tional follow-ups. When tests confirm that the gross errors from these sources are under con- 
trol, the net error can be assumed to be negligible. For example, the distribution of the erroneous 
enumerations by age group is expected to have a large number of duplications in the highly- 
mobile groups of the population where there are more opportunities for the census residency 
rules not to be followed. 

In the 1986 TARO, an evaluation of the E-sample processing was conducted in conjunc- 
tion with the evaluation of the P-sample matching operation discussed in Section 5.2.3 (Corby 
and Mulry 1988). The data for the E sample from the same subsample of 35 blocks were 
reprocessed. 


5.5.4 Estimation 


We now estimate the moments of the distribution of c, from the PES sample. The results 
of the reprocessing (Hogan and Wolter 1988) yield a net error rate of 0.0007 in the identifica- 
tion of correct enumerations. The expected value of c, is E(c,) = —238. This estimate is 
based on the E sample with a resolved enumeration status because the error in the imputation 
for the unresolved cases is covered in the Missing Data Section 5.7. 

An estimate of the variance of net error has not been calculated. Our professional 
judgment is that a conservative estimate of the variance can be derived by the reasoning 
discussed in Section 5.2.2. Thus, we estimate that the variance for the TARO site is 
War(c,) <="(17)* xel4e= 4,046: 
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5.5.5 Summary 


For the total error model, the first two moments of the posterior distribution of the net error 
in identifying correct enumerations are assumed to be E(c,) = —238 and Var(c,) = 4,046. 


5.6 Balancing the Estimates of the Gross Overcount and Undercount 


5.6.1 Source of Error 


Both the E sample and the P sample measure enumeration errors in the census. The E sample 
measures the gross overcount in the form of erroneous enumerations. The P sample measures 
the gross undercount in the form of those not enumerated. Ideally, the entire census would 
be searched before a P-sample person was declared to be not enumerated. Ideally, the entire 
country would be searched to determine if an E-sample enumeration is a duplicate or fictitious. 
Of course, such extensive searches are simply not feasible in the performance of the PES. These 
searches must be limited in the reasonable manner. The way chosen has to preserve the net 
error although the measured gross overcount and the measured gross undercount may increase 
due to limiting the search area. The gross overcount and the gross undercount have to balance 
to equal the net coverage error. 

Failure to have procedures which balance the estimated gross overcount and the estimated 
gross undercount may cause an incorrect number of enumerations in the E sample to be 
designated as erroneous when they are correct. This error may cause either an upward or 
downward bias. 

Balancing is not an issue for the design of the PES planned for 1990 and tested in the 1986 
TARO, as it was in 1980. The design calls for overlapping the P sample and the E sample. The 
same blocks are included in the P sample as in the E sample. The P-sample search area is, by 
definition, the proper search area. The E-sample search area is chosen to be consistent with 
the P-sample search area. 


5.6.2 Summary 


Error due to geocoding error is believed to be negligible in the 1986 TARO and will not be 
included in the total error model. The appendix contains a model for balancing error. 


5.7 Missing Data 


5.7.1 Source of Error 


Both the E sample and the P sample have missing data. The E sample has cases where the 
information required to determine whether the person is correctly or erroneously enumerated 
in the census is not available. The P sample has cases where the information needed to deter- 
mine whether the person is enumerated in the census is not available. The probability of being 
enumerated is imputed statistically to compensate for the inablility to resolve the case. 

An unresolved status may occur in more than one way. The interviewer may be unable-to 
obtain an interview during the P-sample interviewing or during the PES follow-up. A P-sample 
or E-sample questionnaire may not have all the demographic and housing information required 
for the estimation. Even with all the information requested on the questionnaires, the cir- 
cumstances may be so unclear that the enumeration status can not be resolved. 


5.7.2 Measurement 


We assess the error in the DSE caused by missing data instead of considering each component 
cj, m; and n,; separately. Our approach is to perform a sensitivity analysis of reasonable alter- 
native models for compensating for missing data. First a preferred method of imputation for 
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unresolved P-sample and E-sample enumeration statuses is specified prior to the implementa- 
tion of the PES. Reasonable alternative treatments of the missing data can be suggested by 
problems that arise during the collection and processing of the PES data. The DSE can be com- 
puted under these alternative models for compensating for missing data. The range of the alter- 
native estimates indicates the sensitivity of the DSE to the method of imputation. For example, 
a narrow range implies that the estimates are robust, and the missing data cause little uncer- 
tainty in the estimates. 


5.7.3 Estimation 


The effect of missing data on the estimates from the 1986 TARO was assessed by examining 
the range of estimates obtained when methods of imputation based on reasonable alternative 
assumptions were used in place of the preferred method. These included alternative treatment 
of proxy responses, movers, and designation of ficticious enumerations (Schenker 1988). The 
alternative treatment of the proxy interviews for P-sample cases classified them as noninter- 
views and applied the weighting adjustment. This essentially assigned proxy cases the same 
match rate as nonproxy cases. The alternative treatment of the P-sample movers reclassified 
them all as unresolved and imputed a match probability, instead of imputing for only those 
who were not resolved. This essentially assigned movers the same match rate as nonmovers. 
The alternative treatment of fictitious cases resulted from a review of the unresolved E-sample 
cases by experienced matching personnel who converted some unresolved cases to fictitious. 
This raised both the observed and imputed rates of erroneous enumeration. 

Models 000 and 111 shown in Table 4 of Schenker’s paper give the upper and lower bounds 
of the estimates of undercount rates, respectively. Both models differ from TARO in that they 
have inmovers as substitutes for outmovers. P-sample inmovers are P-sample respondents who 
moved into their housing unit between Census Day and PES interviewing. In the 1986 TARO 
the P-sample inmovers from areas outside the test site were omitted from the PES estimation. 
The omission of the outmovers from estimation essentially assumes that they had the same 
capture rate in the original enumeration as the included cases. Movers are believed to have a 
lower capture rate than nonmovers. Model 000 has the TARO treatments while Model 111 has 
all the alternative treatments. 


5.7.4 Summary 


The effect of missing data on the distribution of the total error is assessed by computing 
the distribution of the undercount rate under several reasonable imputation methods. The alter- 
native methods which yield the upper and lower bounds for the undercount are used in the 
total error analysis. 


5.8 Sampling Error 


5.8.1 Source of Error 


The observed DSE is subject to sampling error because N,, C, and M are estimated from 
samples. The sample size for the PES is determined by the amount of sampling error and budget 
allowable. Other things being equal, the larger the sample size the lower the amount of sampling 
error introduced in the estimates. The sampling errror is affected by the estimator and the 
sampling design. In the TARO PES design, both the P-sample and the E-sample observations 
are collected from the same sample of blocks. All the people residing in the housing units in 
the selected blocks are included in the P sample. All enumerations assigned by the census process 
to the sample block are included in the E sample. The estimation of the sampling error takes 
into account the tendency for census misses and erroneous enumerations to be correlated within 
blocks and within housing units. Experience has shown that many hard-to-enumerate areas 
have both a higher rate of omissions and a higher rate of erroneous enumerations. 
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5.8.2 Measurement 

The standard randomization theory model for survey sampling is appropriate for estimating 
the variance of the DSE. The coefficient of variation which is the ratio of the square root of 
the variance of the observed DSE to the mean of the distribution of the DSE provides infor- 
mation on the amount of sampling error in the DSE. 

The Taylor series estimator of variance for the observed dual system estimator (Moriarity 
1987), v\N, ,), is given by 


V(N44) = N44 (v(N,)/NG + V(M)/M? — 2c(N,M)/N,M) 


+ Nz v(E)/M? + 2N,.(N,c(E,M)/M? — c(E,Ny)/M), 


where 
E = D+ EE, 
v(X) = the estimator of the variance of an estimator X, 


c(X, Y) = the estimator of the covariance between X and Y. 


The categories J/,, insufficient information for matching, and EE, erroneous enumera- 
tions, are treated as one group in the variance estimation. The variance and covariance 
estimators reflect the cluster sampling of blocks and block clusters. 


5.8.3 Estimation 

The standard deviation of the dual system estimate of 388,040 for the TARO site is 3, 100.37. 
The coefficient of variation is 0.008. This implies the standard deviation for the estimated net 
undercount rate is 0.7 percent. 


5.8.4 Summary 


The sampling error for the TARO DSE is 3,100.37, and the sampling error for the TARO 
net undercount rate estimate is 0.70 percent. 


6. SYNTHESIS OF TOTAL ERROR 


The combined effect of the component errors will be summarized by posterior distibutions 
for the net undercount rate. The bias in the estimate of net undercount rate, B(U), is estimated 
by the difference between and the mean of the posterior distribution. To construct the posterior 
distribution, we used a simulation method with 10,000 repetitions, generating pseudo-random 
component errors and adding them to the TARO estimates. Using the formulas in Section 5.1.2, 
we obtain the following formula: 


N = (Ny — my) + (C +¢-— (M— m)) 

+ 0(C — c — (M — m))(N, — n, — (M — m))/(M — m) 

(C — c)(N, — ny)/(M — m) , : ] 

+ (8 — 1)(C —c¢ — (M— m))(N, — ny — (M — m))/(M — m). 


Several different distributions were used to reflect alternative estimates of imputation error, 
alternative estimates of correlation bias (parameterized by 6 ), and alternative marginal distribu- 
tional forms for the components - normal, gamma, and uniform. 
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In this study, the estimate of percent net undercount for the TARO site is 8.42 with a sampling 
standard deviation of 0.7. This estimate was selected because estimates of nonsampling error 
components are available only for the site as a whole. When a DSE is constructed for each 
post-stratum and then the DSEs are summed to give an estimate for the site, the percent net 
undercount estimate is 9.02. 

Table 6 displays the means and standards deviations of the error components for the PES 
sample. Recall that the DSE for the TARO site is 388,040, @ = 298,204, C = 343,567, and 
N, = 336,707. The overall sampling weight, 17, was used consistently throughout all the 
simulations so that comparisons of the effect of alternative assumptions such as correlation 
bias parameter values, error distributions, and imputation models are appropriate. The meth- 
odology generalizes to other applications where a different sampling weight is used in each 
stratum. 

Table 7 displays the effects of the individual errors on the posterior distribution of the under- 
count when the TARO imputation is used. The net matching Census Day address, and fabrica- 
tion errors are all errors in M. Therefore, the presence of only one of them alone causes the 
bias in the estimate of percent net undercount to be positive. The net E-sample error is an error 
in C. The presence of E-sample error alone causes the bias in the estimate of percent net under- 
count to be negative. The estimate for correlation bias, was chosen to be 2.7, the median of 
Ericksen and Kadane’s estimates. The presence of only correlation bias causes the bias in the 
percent net undercount estimate to be negative. 


Table 6 
Assumed Distributions of Error Estimates 


Mesa Standard 
se Deviation 

Net Matching Error -1831 176 

Census Address Error -3481 510 

Fabrication Error -2502 244 

Net E sample Error -238 64 

Table 7 
Individual Effects of Errors on Posterior Distribution 
of Percent Net Undercount and Bias 
in the Estimate of Undercount 

E(U) Std. Dev. B(U) 
Net Matching 7.86 0.06 0.56 
Census Address 1.35 0.16 iL 7 
Fabrication 7.34 0.08 1.08 
Net E sample 8.49 0.02 -0.07 


Correlation 
Bias (2.7) 10.61 0.00 -2.19 


Mulry and Spencer: Total Error in the Dual System Estimator 


258 


Frequency 
2000 


1500 


(2) 
oO 
ie) 


500 


Undercount 


% 


Figure 1. Percent Undercount when @ = 2.7 
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Table 8 


Percentiles of the Posterior Distribution of 
Percent Net Undercount for 6 =2.7 


1 3 10 25 50 iP) 90 95 99 


Normal 6.70 6.86 6.94 7.08 7.24 7.40 7.54 7.63 Lae 
Uniform 6.75 6.86 6.93 1,07 7.24 7.42 TESS 7.62 TIE} 
Gamma 6.67 6.84 6.93 7.08 7.24 7.40 (Ree 7.64 7.74 


Table 9 


Posterior Distribution of the Net 
Undercount Rate for Several Values of 0 


6 E(U) St. Dev. B(U) 
1.0 5.75 0.18 2.67 
2.1 6.72 0.22 1.70 
247 7.24 0.23 1.18 
aby 8.09 0.27 0.33 


Simulations were conducted where the first two moments for error Np, Coy Mm, Mp, Mg, and 
6 were held constant, but the distributions were varied. We assessed the total error when all 
the error distributions were normal, all were gamma, and all were uniform. Varying the distribu- 
tions had minor effects on the distribution of the percent net undercount. In each case the dis- 
tribution of the percent net undercount was very close to normal. Figure 1 shows the distribution 
of the undercount when @ = 2.7, and it is illustrative of the results of the simulations. 

Table 8 shows the percentiles of the distribution of the net undercount rate for different 
distributions for the component errors when @ is taken to be 2.7 and the TARO imputation 
is used. The standard deviation for the posterior distribution was 0.23. In all the cases, a normal 
distribution is an adequate approximation. The percentiles differed by at most 0.02 for the 
percentiles between 5 and 95. The 1 and 99 percentiles differed by at most 0.08. 

Varying the value of the estimate of 6 for the correlation bias did affect the moments of 
the posterior distribution of the undercount. The variation appears in the mean and in the stan- 
dard deviation. Table 9 shows the results for the different values of 6, where the distribution 
for the errors are normal. The case where 8 = 1 portrays virtually no correlation bias, while 
for the other sources of error are present. In the cases where @ = 2.1, 2.7, and 3.7, all the sources 
of error are taken into account. The distribution of the undercount shifts to the right as the 
estimate of @ for the correlation bias increases. The variance also increases as the estimate of 
6 increases. For all values of 6 considered, the bias B(U) is positive although it decreases as 
@ increases. 

The simulations were conducted with reasonable alternative models for the imputation for 
unresolved match status. Although there was some variation in the first two moments of the 
distribution of the net undercount rate, the estimate of net undercount rate in TARO appears 
robust to missing data. Table 10 illustrates the results of the simulations using models 000 and 
111 described in Section 5.7.3. Models 000 and 111 yielded the upper and lower bounds of the 
undercount estimates under all the reasonable alternative imputation models. The bias in the 
estimate of the percent net undercount rate ranges from 0.93 to 2.79. In other words, the bias 
is between 11 percent and 33 percent of the net undercount rate estimate of 8.42. Varying the 
imputation model has almost no effect on the standard deviation. 


260 Mulry and Spencer: Total Error in the Dual System Estimator 


Table 10 


Posterior Distribution of the Percent 
Net Undercount Under Reasonable Alternative 
Imputation Models When @ = 2.7 


E(U) St. Dev. B(U) 
TARO God 0.23 1.18 
Model 000 7.49 0.23 0.93 


Model 111 5.63 O22 a 


The total variance of the estimated net undercount rate may be estimated by the sum of the 
sampling variance and the nonsampling variance. For the case where 0 = 2.7, the standard 
deviation shown in Table 10 for both models 000 and 111 is 0.22 which translate to a non- 
sampling variance of 0.0005 when all errors are considered. The standard deviation of the 
estimate of net undercount rate is 0.70 which translates to a sampling variance of 0.49. 
Therefore, the total variance is 0.0054 and standard error is 0.73. The coefficient of variation 
of the net undercount rate is 0.083. The nonsampling variance contributes very little to the 
total variance relative to the contribution by the sampling variance. 


7. CONCLUSIONS 


When the post-stratification is used in the estimation, the undercount estimate for TARO 
is 9.02. The post-stratification increased the net undercount rate estimate by 0.6, which is less 
than one standard deviation of 0.73 from the estimate of 8.42. Although we expect the error 
in the post-stratified estimate is smaller, the result is consistent with the error analysis. 

As we consider all the sources of error in the posterior distribution of the net undercount 
rate, we do not know the distribution of the correlation-bias parameter 6. Although we could 
assume a prior distribution for 0, others might disagree. If we were certain that 6 is 2.7, then 
our 95 percent confidence interval for the net undercount rate would be 


A Tigo t5 955: 


We calculate this by taking the post-stratified estimate 9.02 and adjusting for the two bias 
estimates in Table 10, 2.79 and 0.93, and two standard deviations, 2 x 0.73. We feel this is 
a conservative estimate since we use two different bias estimates from imputation models 000 
and 111. A very conservative 95 percent confidence interval for U for any value of @ between 
2.1 and 3.7 is (4.43,10.32). 

We believe the methodology described in this paper is applicable in the 1990 census with 
appropriate modifications. Areas for further research are nonsampling error estimates for post- 
strata, a distribution for the correlation-bias parameter, and models for address reporting error. 
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APPENDIX 


Definition of Balancing Error 

The non-linearity of the dual system estimator makes an additive model inadequate for 
viewing the technical implications of the balancing of the estimated gross overcount and the 
gross undercount. Therefore, a more appropriated multiplicative model is developed in this 
section. 

Limiting the E-sample and the P-sample search areas affects two parts of the DSE. One 
effect is a bias in the estimate of the number of erroneous enumerations, FE . The other is 
a bias in the estimate of the number of people in both the census and the P-sample popula- 
tion, M. 

The following definitions are needed for examining the effects of limiting the E-sample and 
the P-sample search areas in the TARO design on the dual system estimate: 

b = the proportion of the correct census enumerations that are in their P-sample search area. 


g = theratio of the number of correct census enumerations that are in their E-sample search 

area to the number that are in their P-sample search area. 

The proportion g reflects error in the implementation of the survey committed when the 
E-sample search area is not equal to the P-sample search area. The way TARO was executed 
implies g = 1. To show what would happen if g does not equal 1, we will carry g through the 
discussion. 

The limiting of the search area causes only a percentage b of the P-sample people who are 
in both the census and the P-sample population to be designated as matching a census enumera- 
tion. Under these circumstances, a systematic bias equal to (1 — b) Nj, is introduced into 
the esimation of the number of people in both the census and the P-sample population. 
Therefore, the observed really estimates DN;}. 

Likewise, the limiting of the search area causes only a percentage b of the census enumera- 
tions to be available to be designated as correct. Then only a percentage g of those, the ones 
whose search areas are consistent with the proper E-sample search areas, will be designated 
as correct. Under these circumstances a systematic bias equal to (1 — bg)Nj,, is introduced 
into the estimation of the number of distinct people in the census. This bias occurs in the estima- 
tion of the number of erroneous enumerations, EE . With this formulation, the observed 
number of distinct people in the census really estimates bgN, ,. 


If g = 1, no systematic bias is present in the estimation of the dual system estimate because 
bgN.1Ni4/(ONi1) = Ni4Na1/Nis- 
The error in the estimation of N.,,; due to the failure to balance may be defined by 


c, = the error in the number of erroneous enumerations due to the failure to define the 
E-sample search areas consistently with the P-sample search areas. 
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The error c, would be nonzero if g does not equal 1. The ratio g may be greater than or less 
than 1. The error is given by cy; = b(g — 1)N44.- 


Measurement 


In TARO, c, was evaluated by testing to confirm that balancing was not an issue and that 
the design was under control. The percentage of matching enumerations found within the 
sample block was large, which implies that the design was under control. Since the design was 
under control, g is assumed to be approximately 1, and c, is assumed to be negligible. 


Estimation 


The geocoding appeared to be very good in the TARO test site. However, no formal measure- 
ment of the effects of any misassignment on the estimation of FE was conducted. Therefore, 
g is assumed to be 1, which implies E(c,) = 0 and Var(c,) = 0. 
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Representing Local Area Adjustments by 
Reweighting of Households 


ALAN M. ZASLAVSKY! 


ABSTRACT 


Suppose that undercount rates in a census have been estimated and that block-level estimates of the 
undercount have been computed. It may then be desirable to create a new roster of households incor- 
porating the estimated omissions. It is proposed here that such a roster be created by weighting the 
enumerated households. The household weights are constrained by linear equations representing the 
desired total counts of persons in each estimation class and the desired total count of households. Weights 
are then calculated that satisfy the constraints while making the fitted table as close as possible to the 
raw data. The procedure may be regarded as an extension of the standard ‘‘raking’’ methodology to 
situations where the constraints do not refer to the margins of a contingency table. Continuous as well 
as discrete covariates may be used in the adjustment, and it is possible to check directly whether the 
constraints can be satisfied. Methods are proposed for the use of weighted data for various Census pur- 
poses, and for adjustment of covariate information on characteristics of omitted households, such as 
income, that are not directly considered in undercount estimation. 


KEY WORDS: Undercount; Raking; Local-area adjustment; Missing data. 


1. HOUSEHOLD-LEVEL ADJUSTMENT BY WEIGHTING 


A major research effort has been devoted to methods for estimation of the undercount 
in the 1990 Census in the United States (National Academy of Sciences 1985). In one of the 
primary methodologies that has been proposed, a Post Enumeration Survey (PES) would 
be conducted shortly after the Census in a sample of blocks. The fraction of persons in the 
PES who were omitted from the Census enumeration yields an estimate of Census under- 
coverage. Estimates of the undercount would be carried down to some geographical level 
(possibly the smallest geographical unit used by the Census, the block). These estimates would 
apply to classes formed on the basis of characteristics of persons, as well as possibly some 
household or block-level characteristics. The term ‘‘class’’ will be used henceforth to refer 
to estimation or adjustment classes or cells; the term ‘‘block’’ will refer to the smallest 
geographical unit for which undercount estimates are calculated. The 1980 Census found 
approximately one hundred million households in two to four million blocks, depending on 
the definitions used. 

For each block, the outcome of the processes described above would be a vector of 
estimated undercounts, with S components corresponding to the adjustment, or estimated 
number of persons omitted from the census in that block, from each of S adjustment classes. 
The methods by which these estimates are arrived upon are beyond the scope of this paper. 
However, in our examples we shall assume that for each class within each block there is an 
undercount rate, expressing estimated omissions as a fraction of enumerated persons in that 
class and block. In this paper, the term ‘‘adjustment’”’ refers to any process which incorporates 
the estimated undercount into the enumeration. The adjustment classes might be, but would 
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not necessarily be, the same as the post-strata formed in analysis of a Post-Enumeration Pro- 
gram. For forming simple marginal tabulations of persons by characteristics, this informa- 
tion might well be adequate. In particular, small-area counts used for various official and 
commercial purposes could be calculated from block totals. 

However, for some purposes it would be desirable to place the added persons in households. 
We assume for these purposes that there is also an estimate of the number of omissions of whole 
households in each block. There might also be information distinguishing omissions of persons 
within enumerated households from those in omitted households. 

If the resulting adjusted records are to be meaningful, the composition of the added 
households and the relationships of its individual members must be logically consistent and 
typical of the types of households found in that area. The term ‘‘composition”’ will be used 
to refer to the number of household members from each adjustment class. Thus, for example, 
a household consisting of a 20-year old white female head of household, a 75-year-old Chi- 
nese male, and a 10-year-old black daughter would not be a very plausible household, even 
if all of its members were from classes that are well represented in the block. Yet abstractly 
to describe these patterns and create new households that fit them is a daunting task. 


Example 1: Forming a roster of households. 
Table 1 illustrates part of a census enumeration as it might appear on a microdata tape. 


Table 2 represents the same roster, showing how the composition of the households 
might be summarized if there were only three estimation classes: (1) men over 20 years 
of age, (2) women over 20 years of age, and (3) children up to 20 years of age. 


Table 1 
A piece of a sample microdata file 


Name Address Sex Age 


John Smith 328 Main Street M 34 
Mary Smith 328 Main Street F 32 
Louise Smith 328 Main Street F ql 
Nancy Chen 330 Main Street 1a 62 
Jorge Ramirez 332 Main Street M Pa | 
Juan Ramirez 332 Main Street M 24 
Table 2 
Microdata file recoded by household, showing 
composition of households 
Count of persons by class 
Address Tre En CURSE eT Ls ILS I eo Te ee ek 
Class 1 Class 2 Class 3 
328 Main Street 1 1 1 
330 Main Street 0 1 0 


332 Main Street 2 0 0 
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Essentially the same problem arises in many situations in which a household survey must 
be reweighted to match known marginal totals for various classes of individuals. 

The essence of the method proposed in this paper is to assign weights to the households 
enumerated in the census lists for the block, so that the weighted totals of persons in each adjust- 
ment class and the weighted total number of households are precisely equal to the correspon- 
ding adjusted totals. Thus, although the weighting changes the proportionate composition of 
the block, all of the households are real and possess characteristics and relationships that are 
logically consistent and reasonable for that block. This weighting methodology is similar to 
the standard raking adjustment, in which the weight applied to counts in a cell of a contingency 
table is the adjusted count divided by the original count. The household weights are calculated 
after the block totals have been adjusted and will be consistent with those totals. For most 
Census purposes, the weighted records would be an adequate basis for forming published tables 
and sampled lists. 

This proposal might be contrasted with imputation methods, in which undercounted units 
are represented by whole units added to the roster. The imputed units may be either persons 
or households. Although individual persons may be imputed into the block, the problem of 
fitting these persons into plausible households remains unsolved. Placing them in fictitious 
“group quarters,’’ as was done in some tests of adjustment procedures, sidesteps this prob- 
lem at the cost of creating a skewed picture of relationships in the block. Reweighting or imputa- 
tion of individuals would be appropriate for residents of institutions or group homes, for whom 
the particular configuration of persons in the dwelling unit has no particular significance. 

Another approach to imputation starts with probability models for omissions of households 
and of persons within households, and draws imputed households from the posterior 
distribution of the omissions given the enumerated households. This methodology is suited 
to the multiple imputation approach (Rubin 1987), in which the entire imputation process is 
repeated several times to represent the variability introduced by the underenumeration. 
However, in each block roster that is created, totals based on enumerated and imputed 
households would not necessarily be precisely equal to the desired adjusted totals. In this paper, 
our concern is with methods that give an exact fit to population estimates derived at a preceding 
stage. 

The remaining sections of this paper develop methods for the proposed weighting adjust- 
ment. Section 2 gives a mathematical formulation of the objectives of the weighting scheme, 
while Section 3 explains how to fit the weights. Section 4 explains how to incorporate the distinc- 
tion between omissions in enumerated and omitted households into the scheme. Section 5 
introduces some refinements that improve the robustness of the procedure against the variability 
of small blocks. Section 6 describes simulation results. Section 7 discusses the use of weighted 
data for various Census purposes, while Section 8 considers the effects of the weighting adjust- 
ment on covariates that are not part of the scheme used in forming the adjustment classes. 
Finally, Section 9 summarizes some unresolved questions and areas for future research. 


2. OBJECTIVES AND MATHEMATICAL FORMULATION 
OF A WEIGHTING PLAN 


It is an essential goal of the proposed plan that the population of the block be assigned to 
valid household units, so that statistics for which the unit is the household are unambiguously 
defined. Thus, weights are assigned to households; the same weights apply to every person 
within the household. 
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In order that the counts in the weighted roster be those which are given by the predeter- 
mined adjustment, the following constraints must be satisfied: 
(Al) Within each block, the sum of household weights equals the adjusted number of 
households. 
(A2) Within each adjustment class and each block, the sum of weights for persons equals 
the adjusted number of persons. 
In order that the weighted block roster be as similar as possible to the original block roster, 
we further require that: 
(B) The weights should be, in some sense, as close to each other as possible. 


With unit (or equal) weights, the composition of the block remains unchanged. If the weights 
are not very unequal, the census composition of the block is nearly preserved by the weighting 
scheme. To the extent that information about the undercount does not require a drastic revi- 
sion of our view of the makeup of the block such a drastic revision should be avoided, con- 
sistently with good survey practise regarding weights. 

We now turn to the mathematical formulation of these criteria. Suppose that in the block 
under consideration, there are S adjustment classes and J enumerated households, and 
household i contains C., members from class s. Suppose that H is the desired total number of 
households in the adjusted roster for the block and D, is the desired total number of persons 
in class s. Let W;, i = 1,2, ...J, be the weights corresponding to the households. (Al) 
requires that 


if 
Nis W.= H (1) 
(tN 


and (A2) requires that 


I 
ye WiC Dele? eS (2) 
tl 


These constraints can be represented by a matrix equation of the form AW = B, where 


Ac= eal cs bak Vas [W, Wy... Wy] and D’ = [D; Dz ...Ds] (3) 


and lisarow of 1’s. 


Objective (B) is represented by selecting some objective function that represents the distance 
between the weights W and uniform weighting, and minimizing it. We will use the objective 
function T = YW; log (W,). This measure is proportional to the discriminant information 
(Kullback-Liebler information) of the discrete probability distribution (over households) with 
relative weights W; with respect to the probability distribution with equal weights, and is the 
same objective function that underlies the traditional ‘‘raking’’ (iterative proportional fitting) 
procedure for adjusting contingency tables (Deming and Stephan 1940; Ireland and Kullback 
1968; Oh and Scheuren 1978 have a larger bibliography). Thus, our procedure may be regarded 
as an extension of raking. Scheuren (1973) applies raking to reweighting of households; Cilke 
and Wyscarver (1988) reweight to linear constraints but use a different objective function than 
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those considered here. Methods similar to those presented here were developed independently 
by Alexander (1987). 

In the context of raking, initial counts X are given for cells in a contingency table, and new 
cell counts Y are calculated to minimize the objective function ¥ Y; log (Y;/X;). Then the 
weights of the original observations are the ratios W; = Y;/X;. In our context, if X; 
households happened to have exactly the same composition we could regard them, in the same 
way, as forming a single entry in the roster with initial count _X; and fit an adjusted count Yj. 
However, with a large number of adjustment classes, it would be unusual for several households 
in the same block to have exactly the same composition. Thus we will not attempt to group 
households; rather, it is notationally and computationally simpler to list the households 
separately so that for each enumerated household composition the initial count X; = 1 and 
Y, = W,. Aside from this notational difference, the mathematical formulation here differs 
from that of a raking adjustment only in that the linear constraints do not have the special 
structure of margins in a contingency table. For brevity in the presentation of examples, we 
will sometimes include a count on a line to represent that number of identical lines in the ros- 
ter of households. 

In the contingency table setting, raking preserves cross-product ratios of cells, and preserves 
independence of variables when it holds in the original table. For these reasons, it has been 
called ‘‘structure-preserving estimation’’ in small-area estimation applications (Purcell 1979; 
Purcell and Kish 1979). See Section 10.1 for a further discussion of objective functions. 

Our procedure differs from raking in that the linear constraints do not necessarily refer to 
margins in a contingency table. Our methodology includes raking as a special case, as well as 
the raking generalization of Oh and Scheuren (1978) in which different tables are used to fit 
each margin. In fact, constraints may be imposed on continuous as well as discrete covariates; 
applications of this sort are proposed in Section 8.3. Furthermore, the algorithms that are set 
forth allow direct determination of whether there are in fact any weights that are consistent 
with all of the given constraints. It is possible then to select constraints that must be relaxed 
in order to fit weights. These features give these methods potential applicability extending 
beyond the area of representing undercount. 


3. FITTING THE WEIGHTS 


The problem before us now is to determine weights satisfying the constraints AW = B, 
W = 0, minimizing the objective function T = ¥ W; log (W;). To make 7 a continuous 
function of W, we adopt the usual convention 0 log 0 = 0. 

We will call any weight vector that satisfies the linear constraints (the equations and the ine- 
qualities) a feasible solution. As long as there is a constraint on the total weight of the 
households, the set of feasible solutions is bounded and therefore 7 assumes a minimum value 
on it; furthermore, since TJ is strictly convex, the solution is unique. 

The problem of calculating weights then naturally is divided into three tasks: (1) determin- 
ing whether the linear constraints AW = Bare consistent; (2) determining whether there are 
any feasible solutions; and (3) finding the feasible solution minimizing 7. We will suppose that 
there are J households and p constraints, so A isap X J matrix. 


Example 2: Fitting weights. 


Table 3 illustrates the roster of households in a block in which three classes are 
represented, as in Example 1; we may think of the classes as ‘‘men,’’ ‘‘women,’’ and 
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Table 3 


A household roster 


Count per household by class 


inet Se SDT c S eg eB ceek  ee Number of 
ine Class 1 Class 2 Class 3 households 
(men) (women) (children) 
1 0 1 0 50 
Zz 0 1 1 40 
3 1 0 0 40 
4 1 0 2 15 
5 1 1 0 50 
6 1 1 1 60 
7 1 1 Z 40 
Table 4 
Adjusted totals 
Raw Adjustment Adjusted 
count rate count 
Class 1 205 .05 Zils) 
Class 2 240 .03 247 
Class 3 210 .04 218 
Households 295 Oz 301 


and ‘‘children.’’ This table may be regarded as a condensed version of a table with 295 
lines, each representing one household. 

The unadjusted and adjusted counts of households and of persons in each class are 
found in Table 4. The adjusted counts are calculated by applying the listed adjustment 
rates and rounding. The method by which the adjusted counts are obtained is immaterial, 
however, to the rest of the process. 


3.1 Consistency of Linear Constraints 

As long as the rows of A are independent, the constraints AW = Bwill be consistent. If 
any row is dependent on the others, the corresponding constraint is either inconsistent or redun- 
dant, depending on the values in B. Dependent rows can be identified by applying the O-R 
decomposition to A’. If the corresponding constraints are redundant, they may be deleted with- 
out any loss of information; if they are inconsistent, the constraints must be reformulated in 
some way. 


Example 2: (continued). 


The A matrix for this example has independent rows, and hence the constraints are 
consistent. 
In Section 5, we consider circumstances in which inconsistent constraints are likely to appear 
and some methods for dealing with them. 


3.2 Existence of Feasible Solutions 


Determining the existence of feasible solutions is equivalent to determining an initial feasible 
solution in a linear programming problem, and the standard algorithms can be used. Suppose 
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our problem is to find a positive solution Wto AW = B, where B = O. (If the latter condi- 
tion does not hold it can be made true by reversing the sign of negative elements of B and the 
corresponding rows in A.) Then create an augmented problem [A | J] [W’ | Z’]’ = 
B, W, Z = 0, where Jis ap xX p identity matrix and Z is a p element vector variable. This 
problem automatically has an initial solution W = 0, Z = B. Thenapply the simplex method 
(as in Gass (1964) or any other linear programming text) to minimize ¥ Z;. If that sum can 
be reduced to 0, the corresponding W values are a solution to the original problem, while if 
it cannot, the original problem has no solution. 


Example 2: (continued). 


A feasible (but not optimal) solution for this example gives total weighted counts of 
86, 54, 29, and 132 to the household compositions in lines 2, 3, 5, and 6 respectively of 
Table 3. It may be verified that these counts yield the desired adjusted totals for 
households and for individuals in each class. 


The problem of infeasibility is similar to that of inconsistency and is also discussed in 
Section 5. 


3.3 Optimizing the Objective Function. 


By the method of Lagrange multipliers, the minimizing solution must satisfy the equations 
dT/dW; = log W; + 1 = a;’X, where a; is the i-th column of A and’ = (Ay, Ay, ... ee) 
Then W; = exp(a; \ — 1); thus the model for the weights is log-linear in form, like that for 
a conventional raking adjustment. , represents the additional log-weight increment associated 
with a unit increment in the corresponding constraint coefficient a;,, i.e. adding an additional 
household member from adjustment class s to the household. 

We can solve for \ by Newton’s method to satisfy AW = B. The iterative scheme we use is 


Nao S Ob SAW G4 WB): (4) 


where W* is the matrix with the elements of W = W(\“) on the diagonal. A good starting 
value for \ is \°°) = (AA’)~!B, which can be derived from a linear approximation around 
equal starting weights. A cyclic descent procedure for solving these equations, which is a 
generalization of iterative proportional fitting, is described in Section 10.2. 


Example 2: (continued). 


The weights per household and total weighted counts (weight times raw count) for 
each line in Table 3 are shown in Table 5. No household is upweighted by more than 8% 
or downweighted by more than 5%. 


Table 5 
Optimal weights for Example 2 
Line # Weight Meet 

counts 
1 0.9554 47.77 
2 0.9557 38.23 
3 0.9816 39.27 
4 0.9823 14.73 
5 1.0730 53.65 
6 1.0734 64.40 
i} 10737 42.95 
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4. WHOLE-AND WITHIN-HOUSEHOLD ADJUSTMENTS 


We now consider the distinction between within-household adjustments (that is, adjustments 
for omissions of persons within enumerated households) and whole-household adjustments 
(that is, adjustments for omissions of whole households). This distinction has previously been 
made for purposes of analysing the causes of undercount (Fay 1986). Our concern here is to 
use it to more accurately represent the undercount by an adjustment. 

Within-household adjustments do not involve adding any households to the roster, but only 
shifting weight between households to increase the weighted totals of persons in the various 
classes. That is, households with few or no persons in a particular class are downweighted and 
those with many are upweighted, so that the total household weight remains constant. Thus, 
in this portion of the adjustment, some households will inevitably have their weights reduced. 
Whole-household adjustments, on the other hand, correspond to households that were omit- 
ted entirely from the census. These adjustments do not reflect on the accuracy of the enumerated 
households; thus they should be represented by adding households to the roster without taking 
weight away from the households that were enumerated. 

We propose to separate these two portions of the adjustment. One set of constraints 
represents the within-household adjustment. The total household weights are here constrained 
to equal the enumerated count of households, while the total weights assigned to persons in 
each class are constrained to equal the enumerated count in that class plus the within-household 
adjustment for that class. AW, = B, where B, consists of the enumerated household count 
and the counts of persons by class adjusted for within-household undercount. 

A second set of constraints represents the whole-household adjustment. The total household 
weights are here constrained to equal the estimated omitted households, and the total person 
weights in each class are constrained to equal the estimated omitted persons in those households. 
AW, = B, where B; consists of the count of added households and the counts of added per- 
sons by class for the adjustment for whole-household undercount. 

After fitting two sets of weights corresponding to the two sets of constraints, the two weights 
for each household are added to obtain weights that incorporate both parts of the adjustment 
(W = W, + W,). The distinction between whole- and within-household adjustments 
contains information which may lead to a different set of adjusted weights than would be 
calculated if the adjustments were combined, as is illustrated in Example 3. However, if this 
distinction is not made in the estimation of the undercount, an adjustment can still be calculated 
in a single step. 


Example 3: adjustments for whole-household omissions. 


Suppose there are only two adjustment classes, and a hypothetical block has the com- 
position described in the first three columns of Table 6. 

Suppose now that to the 30,010 households enumerated, we must add 231 persons each 
in Class 1 and Class 2, and 121 households. The last three columns of Table 6 show the 
adjusted counts under alternative assumptions: (1) the omitted persons may belong to 
any household, enumerated or omitted, and (2) all of the omitted persons were in the 
omitted households. 

When the omitted persons could have been in any household, the algorithm 
downweights the households with only one person from each class (1,1) and upweights 
households with two from one class and one from the other (1,2 and 2,1). While the 
households with two persons from each class are substantially upweighted (by a factor 
of 1.354), only a small portion of the added persons appear in those households since 
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Table 6 


Hypothetical raw and adjusted household counts for Example 3 
— ee ee ee a eer a ee eee ee 
(1) Omitted 


Household persone inane (2) Omitted persons in 
composition Raised omitted households only 
Raw count ? 
(number of C fees 
Class 1 Class 2 households) Adjusted gue rOreS 
ate persons EL Ba of omitted omitted and 
p households enumerated 
households 
1 1 10,000 9904.54 01 10,000.01 
1 2 10,000 10106.46 10.99 10,010.99 
2 1 10,000 10106.46 10.99 10,010.99 
2 2 10 13.54 99.01 109.01 


the original count for that composition is so small. 

When the omitted persons appear only in the omitted households, weights are 
calculated first to fit 231 x 2 = 462 persons into 121 additional households, and then 
these weights are added to the unit weights in the raw counts. While no household com- 
position is downweighted, the (2,2) households are upweighted extremely (by a factor 
of 10.901). In fact, it is mathematically impossible to accommodate 462 persons in 121 
households of two to four persons each without having at least 99 households with 4 
members. Thus, the information that the added persons (or some known fraction of them) 
belong in the omitted households substantially changes our view of the appropriate 
adjustment. 


5. FEASIBILITY OF CONSTRAINTS 


In the preceding sections we have assumed that feasible solutions exist to the constrained 
optimization problem. Here we will consider situations in which the solutions will not exist 
or will be unsatisfactory, and some alternative methods to deal with these situations. 


5.1 When Will Constraints be Non-feasible? 


There are three ways in which the constraints may fail to allow of satisfactory solutions: 
(1) when the constraints are actually inconsistent, (2) when the constraints are consistent but 
there are no positive weights that satisfy them, and (3) when there is a feasible solution but 
it involves an extreme adjustment to some household weights. The issues associated with these 
three failure modes are fairly similar. 

One could write down constraints that are intrinsically inconsistent, for example that all 
classes of men are adjusted upward by 2% while men in total are adjusted upward by 4%. In 
our procedure each constraint applies to the number of persons in a distinct adjustment class 
and so there are no inconsistencies of this sort. However, a contingent inconsistency is still 
possible, that is to say one that depends on the particular collection of household composi- 
tions that appears in a block. The following are examples of contingent inconsistency, 
infeasibility, or unsatisfactory weights: 


(1) Proposed undercount estimation methods envision defining over 100 adjustment classes. 
In a small but diverse block the number of classes represented might be larger than the 
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number of households; hence the number of constraints would be larger than the number 
of weights to be fitted. An inconsistency is then almost inevitable. 

(2) If all households in the block have exactly the same number of members from a particular 
adjustment class (e.g. every household has one young Hispanic girl), then the number of 
members of this class represented is unaffected by the distribution of weights. 


(3) The adjustment of the number of households may be too large or small to accommodate 
the adjustment of persons in some class. This may represent a failure of the model for 
adjustment of the number of households. For example, suppose that the number of men 
to be added by the whole-household adjustment is greater than the number of households 
to be added, but no household in the block has more than one man. The constraints then 
might be consistent but infeasible, since they could be satisfied only by assigning negative 
weights to some households without men. 


(4) The block may have had omission rates atypical of blocks in the PES on which omission 
rates were estimated. For example, suppose that in most blocks (including most of the PES 
sample blocks), adult males with certain characteristics tend to be heavily undercounted, 
but the block being adjusted is atypical in having adult males of this class present in most 
households and well counted. The class undercount estimate might lead to an extreme 
upward adjustment that could not be accommodated within the existing households. 


(5) Some adjustment may require giving substantial additional weight to households containing 
persons from a combination of adjustment classes that appears in only one household, 
so that household receives an extreme weight. In this case the problem is feasible but the 
solution is not very satisfactory. 


Problems of infeasibility may also arise where the difficulty cannot be so easily traced to 
a particular inconsistency in the adjustment. 


5.2 Making the Constraints Feasible 


Regardless of the stage of the fitting procedure at which the infeasibility is discovered, several 
methods are available to relax the constraints and make them feasible. In this section, we survey 
several such methods, drawing out both the intuitive logic of each choice and the computa- 
tional methods required. 


5.2.1 Methods Based on Dropping Rows (constraints) of A 


When checking for consistency of constraints, some rows may be found to be linearly depen- 
dent on the previous rows and hence either redundant or inconsistent. If these rows are simply 
dropped from the A matrix, a consistent set of constraints is obtained; thus, no further com- 
putational effort is required. 

If the constraints are arranged in sequence from the most important to the least important, 
than the less important constraints will be dropped when they are inconsistent with the more 
important ones. This ordering makes the most sense if the original constraints on distinct adjust- 
ment classes (defined by a multi-way classification of the population) are reframed in an 
ANOVA-like manner as constraints on total population (‘‘grand mean’’), classes defined by 
one classification variable (‘‘main effects’’), and classes defined by interactions. For exam- 
ple, if there are ten adjustment classes defined by two sexes and five age ranges, the reframed 
constraints in order of importance might be: total population (1 constraint), population by 
sex (1 more constraint), population by age (4 more constraints), age-sex interactions (the remain- 
ing 4 constraints). The 4 age constraints could be further broken down as old-vs.-young (1 con- 
straint) and 3 further constraints within those larger groups. 
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A similar procedure can be applied at the stage of checking feasibility of the constraints. 
If it is not possible to make all of the Z; = 0, the objective function in the linear programming 
problem can be modified to be ¥ c;Z;, with the coefficients c; > 0 corresponding to the most 
important constraints made larger. Then a maximal set of feasible constraints can be identified, 
and the remaining constraints dropped. 

The outcome of this procedure would be weights that give the correct block totals on the 
coarser classifications of persons, while failing to be correct on all cross-tabulations. 


5.2.2 Methods Based on Adding Columns (households) to 4 


When constraints are only contingently infeasible (in the previous sense that infeasibility 
depends on the particular set of household compositions in the block), they become feasible 
when households are added that have the required composition. The simplest application of 
this principle is to work at a higher level of geographical aggregation than a block. A few adja- 
cent blocks may be combined when problems arise in fitting, or the entire roster may be grouped 
at, for example, the enumeration district level before weighting. The larger the unit, the broader 
the range of household compositions that will be represented and the less likely that problems 
of infeasibility will arise. 

A more sophisticated procedure would use a hot-deck of households from adjacent ‘‘donor’’ 
blocks to enrich the pool of households to which weight can be assigned. Computational 
simplicity is important here since it may be necessary to scan through a long list of households 
to find the one or ones which will make the constraints feasible. In the consistency-checking 
stage, if row j of A is dependent on the previous rows, then if the column for the added 
household is independent of the columns of A (with regard only to the first / rows), row j of 
the augmented A will be independent. In the stage of checking for feasibility, if the algorithm 
halts because no reduction can be made in the objective function ¥ Z;, the search for basic 
columns can be extended to columns corresponding to households in the hot deck. Finally, 
if some household’s fitted weight is extremely high, the hot deck can be scanned for other 
households that would also receive high weights with the current values of ) (that is, columns 
a such that a’) is large). If these are added to the block they will draw off some of the weight 
from the overweighted households when the weights are refitted, since they are likely to also 
have members in the same adjustment classes. 

The intuition behind this method is that the household compositions that are enumerated 
in a block are only a sample of those which actually could have appeared there had the enumera- 
tion been complete. The observed distribution of household compositions is smoothed by 
mixing it with the distribution for adjacent blocks, which contain households that are also 
typical for that area. Thus, conceptually this method is related to Bayesian smoothing methods 
that improve estimation of some quantity for one unit by borrowing strength from its distri- 
bution in similar units. This Bayesian rationale is developed in terms of a block-level random- 
effects model by Zaslavsky (1989). 

The donor blocks could be chosen by a sequential hot deck procedure; then, the donor blocks 
would tend to be geographically close to the adjustment block and no particular set of blocks 
would have undue influence on the entire census. By detailed stratification of blocks, the donor 
blocks could be selected to be similar to the block being adjusted on characteristics such as 
mean income, types of housing units, and racial balance. 


5.2.3 Combined Methods 


The two types of methods outlined above can be combined by an appropriate reframing 
of constraints. The principle here is to satisfy a// constraints in the larger geographical units, 
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while satisfying only the more important constraints in the smaller units. This type of com- 
promise may make it possible to get a fairly good fit to the desired distribution without having 
to add additional records to the roster. 

Suppose that the A matrices for several blocks have been reframed similarly as sequences 
of rows representing main and interaction constraints. Then a single large A matrix represen- 
ting all of the constraints can be formed. The rows for the more important constraints can 
be kept separate, while rows for subsidiary constraints can be combined across blocks. For 
example, suppose there are ten adjustment classes, defined by sex (2 levels) and age (5 levels), 
and two blocks. Altogether there are eleven constraints (one for number of households and 
one for each adjustment class) in each block. If these are combined into a single matrix, keep- 
ing main effects and two-way interactions, the constraints are: block household counts (2 con- 
straints), block populations (2 constraints), sex (1 constraint), age (4 constraints), block x sex 
interaction (1 constraint), block x age interaction (4 constraints), and sex x age interaction 
(4 constraints) in the combined blocks. Here 4 constraints have been eliminated 
(block x sex X age interaction); in a more realistic problem with more blocks, classification 
variables, and levels, the reduction would be much greater. 


6. SIMULATION RESULTS 


Simulations were performed to answer two classes of questions: 


(1) The first set of questions is concerned with evaluation of the success of the algorithm in 
terms of its own constraints and objectives. Does the reweighting algorithm give an answer? 
In real problems, is there a solution to the weighting constraints? How much do the weights 
vary? Is the amount of computation required within reasonable limits? 


To answer these questions, ‘‘feasibility simulations’’ were performed in which the weighting 
algorithm was applied to simulated blocks made up of real households, using real adjustment 
rates. This procedure thus closely parallels the practical application of the algorithm. 


(2) The second set of questions is concerned with evaluation of the success of the algorithm 
in improving the quality of inferences based on a micro-data set: does the weighted micro- 
data set more accurately describe the real world than the raw, unweighted data? 


To answer these questions, simulated blocks made up of real households were drawn, 
representing the true (but unobserved) compositions of households in blocks. For each “‘true”’ 
block, omissions were imposed using real estimated undercount rates and a plausible model 
for the distribution of undercount among households. The weighting algorithm was applied 
to the ‘‘enumerated’’ blocks generated in this way. Summary statistics describing household 
composition were calculated for the simulated ‘‘true’’ blocks and for the simulated observed 
blocks with undercount, both unweighted and weighted for undercount adjustment. The goal 
of these ‘‘inference simulations’’ was to determine whether the reweighting brought the statistics 
closer to their values in the ‘‘true’’ blocks; in other words, did reweighting correct the biases 
caused by the undercount? 

The source of households for all simulations was the 1% ‘‘B’’ Public Use Microdata Sam- 
ple (PUMS) from the 1980 Census (Bureau of the Census 1985). Households were extracted 
from sections of Los Angeles County, California that include the site of the Test of Adjust- 
ment Related Operations (TARO) of the 1986 Test Census. 

Undercount rates were those calculated from the 1986 TARO (Diffendal 1988, Table 7) for 
adjustment classes defined by sex, age (five levels), race (Hispanic, Asian, or ‘‘other race’’), 
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and tenure (owner or renter). Adjustment factors calculated from the given undercount rates 
ranged from 0.982 to 1.211. 

Each household was coded as a vector of counts representing the number of individuals in 
that household from each of the 60 adjustment classes. 

Further details on the simulation procedures and on a larger set of simulations are in 
Zaslavsky (1989). 


6.1 Feasibility Simulations 


For each of four block sizes (20, 50, 100, and 200 households), 50 simulated blocks were 
drawn from the full sample and 50 were drawn from only those households with no Asian 
members. For each block, simulations were attempted using two levels of the household adjust- 
ment rate (the factor by which the number of households in the block is adjusted). 

The algorithms of Section 3 were applied. To recapitulate, the linear constraints were checked 
first for consistency, and then for feasibility (existence of a positive solution); finally, weights 
were calculated using Newton’s method. As no data were available distinguishing within- 
household and whole-household omissions, no effort was made to separate them in these or 
other simulations. 

The results of these simulations are summarized in Table 7. 


Consistency and feasibility: 


The columns headed ‘‘incons’’, ‘‘infeas’’, and ‘‘OK’’ represent the number of simulated 
blocks (out of the 50 trials) in each simulation that fell into each of the following categories 
respectively: (1) the constraints were inconsistent (could not be satisfied by any weights), (2) 
the constraints were consistent but not feasible (could not be satisfied by any positive weights), 
or (3) the constraints were both consistent and feasible. 

In the ‘“‘non-Asian’’ simulations there are 41 constraints to be satisfied (some of which may 
be trivial, 7.e. when the corresponding adjustment classes are unrepresented in the block). Thus 
with 20-household blocks, the constraints were never consistent; with 50-household blocks, 
the constraints were sometimes consistent and then usually feasible. The constraints were usually 
feasible in 100-household blocks, and always in 200-household blocks. 

The numbered columns at the right represent the order of the simplest marginal constraint 
that could not be satisfied, in the sense of the heirarchical reparametrization in Section 5.2.1. 
Thus, column (1) indicates the number of simulated blocks for which a ‘“‘main effect’? con- 
straint (marginal total of persons classified by one stratifying variable) could not be satisfied, 
column (2) indicates the number of trials for which a two-way interaction constraint could not 
be satisfied, etc. Even when the constraints were inconsistent with 50- or 100- household blocks, 
the main-effect constraints and often the two-way or even three-way interactions were feasi- 
ble. This suggests that pooling of blocks for higher-order interactions, as described in Section 
5.2.3, might be a successful strategy for dealing with problems of infeasibility. 

The results were less encouraging for simulations using the full samples. Even with 200- 
household blocks, only rarely were the constraints consistent and feasible. With increasing block 
size the lower-order constraints were more likely to be feasible. This is explained by the small 
number of households with Asian members (approximately 5% in each sample). Out of 200 
households, the expected number of Asian households would be about 10, an insufficient num- 
ber to satisfy the 20 possible constraints for the Asian adjustment classes. Such a situation in 
which some groups of adjustment classes are poorly represented in a certain region or in par- 
ticular blocks would surely not be unusual in practise. This would require pooling of blocks 
on a large scale for the corresponding constraints, while the constraints for the better- 
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Table 7 


Feasibility simulation results 


Non-Asian Households 


size HH rate incons infeas OK maxW minW_ varW iters (1) (2) (3) (@ 
10 1.00 50 0 0 NA NA NAws bINAt oi22icee 28 0 p20 
10 1.05 50 0 0 NA NA NA NA Sa eee Ogi () 
20 1.00 50 0 0 NA NA NA NA OOMNSO 0 O 
20 1.05 50 0 0 NA NA NA NA Oe 7450 Ore0) 
50 1.00 47 1 2 1.921 0.200 0.142 3.00 0 5 ae f pes 
50 1.05 47 0 3 135507620) 5 OL0367 SSIES 0 See 36 ees 
100 1.00 10 0 40 2.068 0.429 0.088 2.03 0 0 Sir 2 
100 1.05 10 0 40 LESS yids cost 70020 ds 1.90 0 0 S as 
200 1.00 0 0 50 2.434 0.543 0.063 2.18 0 0 OF 0 
200 1.05 0 0 50 1.749 © 0.821 0.015 2.00 0 0 O7nnG) 
Full Sample 
size, HH: ..rate, imcons..infeas ..OK.. maxW..-minW..,varWyiters. (1) yr) = 4) 
100 1.00 49 0 1 oe -- =. oe Oe, ag rrelS 0 


200 1.00 49 0 1 -- -- -- =- 0 2 case 4 


represented classes might be satisfied on a smaller scale. 
Weights: 


The maximum and minimum household weights and the variance of the weights were 
calculated for each simulated block for which the constraints were consistent and feasible. For 
each simulation condition, the average value of these quantities (across simulated blocks) is 
displayed under the heads ‘‘maxW’’, ‘‘minW’’, and ‘‘varW.’’ The following observations 
characterize some of the effects of the simulation design factors on the fitted weights. 


(1) For simulations with household count adjustment factor of 1.05, in every case, the average 
variance of the weights was smaller, and the average of the minimum weights and of the 
maximum weights were closer to unity, than with household adjustment factor 1. This is 
intuitively reasonable since almost all class adjustment factors exceed 1, and it requires 
a more extreme adjustment to add individuals to existing households than to add individuals 
and households to accommodate them. For example, if the adjustment factors for 
households and for every adjustment class are all equal, every household would be 
upweighted equally. 


(2) Fixing other factors, the variance of the weights becomes smaller as the number of 
households per block increases. Again, this is intuitively reasonable because the pool of 
households is richer in a larger block; the probability of finding exactly the households 
needed to represent undercounted individuals is higher. The trends for the extreme weights 
are less clear-cut than for variances; here, the narrowing of the variance is offset by the 
larger sample over which the extreme is calculated in the larger blocks. 
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(3) The average variances for simulations with 200-household blocks were at most .063. Thus 
the reweighting is generally not extreme. 


Computational costs: 


The mean number of Newton steps required to fit the weights (from the starting values given 
in Section 3.3), shown under the heading ‘‘iters’’, is usually about two. These iterations were 
sufficient to satisfy all constraints with error no greater than .001. Using this information, a 
rough estimate can be given of the number of floating point operations required to apply the 
algorithm. Computational costs of the modified raking algorithm are discussed in Section 10.2. 

Assume that blocks are of sufficient size that it is not necessary to check consistency and 
feasibility of the constraints in every case (but perhaps only when the weight fitting does not 
succeed in a few steps). Then the key calculation is fitting the weights. For production runs, 
data structures and programs should be devised which take advantage of the sparseness of the 
A matrix (due to the fact that only a few classes are represented in each household). Then if 
S; is the total number of nonzero entries in_A and S5 is the sum (through the block) of the 
squares of the number of nonzero entries for each household, each Newton step requires about 
5S;/2 + S,/2 multiplications (plus a term independent of the number of households per 
block). In the samples studied here, S$; ~ 5S,; S, is bounded by the total population of the 
block. Thus the bound on the number of multiplications is approximately 15 x population 
total (counting the start as an iteration); the number of additions is comparable. 

In an era in which even microcomputers have megaflop arithmetic capability, 8 x 10° 
floating point operations to reweight an entire census does not seem unreasonable. The calcula- 
tion of weights might well take less computer resources than the ‘“bookkeeping’’ data processing 
required in any method of incorporating undercount. Of course, if the procedure were applied 
to a sampled database, as in forming a public-use sample, the costs would be reduced corres- 
pondingly. 


6.2 Inference Simulations 


For the inference simulations, pseudo-blocks of 50 households each with only Hispanic 
members were drawn. These were treated as if they represented true blocks. Then simulated 
Omissions were imposed on the these households, assuming that each member was 
(independently) omitted with probability equal to the undercount rate from Diffendal (1988), 
with two negative undercount rates truncated to 0. 

The entire distribution of the ‘‘enumerated’’ block was represented by including in the 
pseudo-Census roster the true composition and the possible compositions obtained by omis- 
sion of one or more household members, each weighted by its probability under the model. 

The pseudo-Census roster with undercount was then reweighted to the original pseudo-block 
totals for number of households and of individuals in each adjustment class. Both the pseudo- 
Census roster and the reweighted roster were compared to the original pseudo-block. 

The purpose of organizing the simulation in this manner was to remove variability due to 
randomness in the rate of omissions in a block (around the mean undercount rate) and in the 
distribution of the omissions among the households in the block. Furthermore, feasibility is 
guaranteed because the original households are always included (with weights) in the pseudo- 
Census roster. One way of regarding this setup is that each simulated block represents a very 
large population in which observed undercount rates and the distribution of observed com- 
positions approach their expectations. 
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Several sets of statistics were used in evaluation of the reweighting procedure. These were 
all chosen because they summarized household characteristics that are not functions of the 
populations by adjustment class. The first set was the distribution of sizes (number of members) 
of households. Note that the mean number of persons per household, like any function of the 
class totals and household count, will automatically be adjusted to the correct (pre-undercount) 
values; the distribution of sizes, however, is not controlled by the adjustment procedure. 

The second set of statistics was the distribution of number of adult (over 14 years old) 
members in households with one or more children (up to 14 years old). In this case, the mean 
is not automatically adjusted to the correct value, since it depends on the joint distribution 
of counts from different classes within households as well as on marginal totals. 

The last two sets of statistics were the distribution of the age group (coded from | to 5 as 
in the formation of the adjustment classes) of the o/dest male in the household (coded 0 if no 
male is present), and the same distribution for households with one or more children. Again, 
neither the distribution nor its mean are directly constrained to their true values. 

The results of these simulations are summarized in Table 8. Because almost all of the dif- 
ferences noted here are highly significant (relative to between-pseudo-block variances of the 
differences), standard errors are not shown in the tables. The lines of each table are labelled 
‘true’ (for the original pseudo-blocks), ‘“‘enum’’ (for the simulated enumerated blocks, /.e. 
after omissions due to undercount), and ‘‘adjust’’ (enumerated blocks after adjustment for 
undercount). Every column except the means should be read as a percentage of households 
in the block. 


Table 8 


Inference simulation results 


i 


Size distribution 


a 


size | size 2 size 3 size 4 size 5+ mean 
Be LD ce TS a ac eh es 
true 7.240 16.200 20.240 22.600 33.720 3.971 
enum 10.349 19.631 ibe WH 20.690 ASKS 3.632 
adjust LP 16.421 20.596 21.392 34.219 3.971 


Size distribution (number of adults) for households with children 


a 


size 0 size | size 2 size 3 size 4 size 5+ mean 
A eee 
true 0.000 6.925 58.404 17.214 9.125 8.332 2.585 
enum 1.736 18.309 49.874 15.965 7.677 6.439 2.323 
adjust 0.924 13227 48.557 18.223 9.810 9.209 2.562 


Age of oldest male (by five age classifications) 


i 


none age | age 2 age 3 age 4 age 5 mean 
ee see ae Soe” PR Be ey ene Meteey te Ce ar Meera ek oe een ae | eee SES Be ee 
true 7.080 4.000 28.680 33.800 21.960 4.480 2.730 
enum 9.981 7.388 26.296 30.972 21.160 4.203 2.585 
adjust 7.853 5.989 26.307 33.439 PM de | 4.480 2.690 


Age of oldest male (by five age classifications) for households with children 


a 


none age | age 2 age 3 age 4 age 5 mean 
pe tS al So ee Et i Oe a a ee Be Bl 
true 3.602 6.214 30.744 42.649 15.843 0.949 2.638 
enum 5.809 eles ALU SV 39.096 ISLS Se: 0.894 2.488 


adjust 4.272 9.069 27.242 42.038 16.418 0.962 2.601 


i 
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Household size distribution was biased downwards in the enumerated blocks. As well as 
correcting the mean, adjustment brought the estimated percentage for every size substantially 
closer to the true percentage. 

The distribution of number of adults in households with children was also biased downwards. 
The majority of these households had contained two adults, so this size category was most 
understated by the enumerated statistics. Due to the log-linear structure of the adjustment, 
however, the most extreme adjustments were made to the largest and smallest households. Thus, 
the highest size categories were slightly overadjusted and intermediate categories were underad- 
justed; the ‘‘size 2”’ category was adjusted a small amount in the wrong direction. Nonetheless, 
the mean of the adjusted distribution was much closer to the ‘‘true’’ value than the adjusted 
mean was. 

The story is similar for the distributions of age of oldest male. Although these statistics are 
only indirectly related to the counts by class, in almost every case the adjusted distributions 
and means are closer to the ‘‘truth’’ than are the unadjusted distributions and means. 

In summary, these simulations suggest that these weighting adjustments can improve 
estimates of measures of household structure as well as the aggregate counts for which they 
were intended. However, reweighting does not provide accurate adjustments with certain con- 
figurations of the data, such as the many households with two adults noted above; to deal with 
these situations may require a model-based imputation method such as that outlined by 
Zaslavsky (1989). 


7. THE USE OF WEIGHTED DATA 


The product of the methods of the preceding sections would be a census roster in which 
households have weights, persons in households have weights adopted from their households, 
and institutionalized persons have individually assigned weights. This section outlines the use 
of these rosters for various Census purposes. 


7.1 Formation of Tables of Counts 


As with any data set of weighted observations, the sum of weights replaces the simple count 
of observations in forming tables. The only problem created by the use of weights is that of 
obtaining integer entries in the tables. This problem arises even before the calculation of 
household weights: when the estimated omissions are calculated, the counts in each class will 
not in general be integers. 

If the adjusted totals by class are rounded to be integers, any table that aggregates classes 
(for example, a count of adult males that is a sum of counts of adult males from different classes) 
will also contain integers, since it must be consistent with those totals. For tables that are not 
based on those totals, summing the weights in a particular group may not necessarily generate 
integer counts. For example, if a class combines women of ages 20-40, a sum of weights for 
women aged 20-30 would not necessarily be an integer. In any case, it seems unlikely that all 
class weights would be rounded since this might well lose the entire adjustment to roundoff 
error. However, it should be possible to use existing Census Bureau integerizing methods (‘‘con- 
trolled rounding’’) to deal with these problems, especially where non-disclosure requires that 
published counts be rounded anyway (Cox et al. 1986; Cox 1987). 


7.2 Formation of tables of sums and means 


Generally, sums (of continuous quantities) and means are not expected to be integers, so 
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the issue of rounding does not arise. Also, tables based on long-form information are already 
derived from a sample so an additional source of weights should not change the process much. 
A deeper issue is that of the values of non-classification covariates to be assigned to households 
that are ‘‘weighted in’’ to the census; this is discussed in Section 8. 


7.3. Public Use Samples 


The public use tapes are a sample of census records that are released for further analysis 
by consumers of census data. 

To generate these samples from weighted census rosters requires only that the sampling pro- 
cedure be modified slightly to make sampling probabilities proportional to weights. Even on 
the 5% tape (the highest sampling rate), the weighted sampling probabilities should be smaller 
than 1. Once these tapes are produced, the user would not have to be aware of the adjustment 
and weighting process that had gone into generating them. 

The public use tapes are the source of data for many of the more complicated analyses by 
sociologists, economists, planners, etc. in which the details of household composition, as well 
as counts of persons, are of importance. It is important that these tapes could be generated 
easily and used like raw census data. 

As a service to those users of the public use tapes who wish to check the sensitivity of their 
analyses to the undercount adjustment, the tape should include factors (the inverse of the adjust- 
ment weights attached to the household records in the original census rosters) that would allow 
the user to reconstruct the equivalent of the unadjusted census. 


8. ADJUSTMENT OF COVARIATES THAT ARE NOT 
USED IN CLASSIFICATION 


The methods described above guarantee that weighted block totals by variables used in 
classification, such as sex, race, and age group, will equal the adjusted block totals. However, 
these lists will also be used to accumulate totals or counts for variables such as income and 
education that might not be used in the classification scheme. This section will consider the 
effect of these adjustment methods on such statistics. For concreteness of exposition, income 
will be used as the main example. Income is an important non-classification variable; some 
research suggests that revenue allocation programs may be most affected by errors in measure- 
ment of income. (National Academy of Sciences 1985). 

In general, there are two possible sources of bias in the estimation of a non-classification 
covariate: (1) bias in adjustment of household composition, and (2) systematic differences 
between fully enumerated households and households with similar composition that are omitted 
(entirely or in part). However, if we have an estimate of mean income for the block, we can 
make the weighted mean for households in the block equal the estimated (adjusted) mean in 
much the same manner we make the weighted counts of individuals in the block equal the 
estimated (adjusted) counts. 


8.1 Household Composition Bias 


In this section we will assume that the average income level associated with a certain 
household composition is the same for fully enumerated households and those which are partly 
or wholly omitted from the enumeration. In other words, we consider here the case in which 
omission is noninformative for income. 
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Suppose that household income is a sum of independent contributions from persons of each 
class in the household (i.e. suppose that the contribution to income from persons in each class 
are independent of what other members are in the household). Then weighted household income 
totals would be an unbiased estimate of the true income totals (when adjustment rates are cor- 
rect), since the sum of incomes would be a linear function of class counts for the block. How- 
ever, under the more realistic assumption that linearity does not hold, misallocation of persons 
between households (and corresponding misrepresentation of household composition in the 
adjustment) could lead to bias in income estimates. Thus, for example, the average income 
of households with two children might not be the mean of the average income of one-child 
and three-child households (with the same composition of adult members). Then the weighting 
procedure might introduce the correct number of children but if, on the average, too many 
(compared to the truth) two-child households were created relative to one- and three-child 
households, estimates of household income would be biased. 

Our procedure tends to fit weights that make the ‘‘adjusted-in’’ households similar in com- 
position to those that are common in the enumeration. However, the adjustment is described 
only by adjustment class totals, which do not carry detailed information on the composition 
of the omitted households. Thus, if certain household compositions are disproportionately 
undercounted they may be underrepresented in the weighted lists, and if these compositions 
are associated, for example, with lower incomes, the total income estimates will be biased 
upwards. 

This is essentially a problem of potential lack of fit of the model used in adjustment to the 
patterns in the data. The most severe biases might appear in statistics that refer specifically 
to household composition, such as the number of single-parent families. 

If composition bias were found to be a serious problem, one approach to controlling it would 
be to augment the class adjustment rates with additional information that describes the joint 
omissions of persons from different classes (or grouped classes). 


8.2 Response Bias 


It is not unreasonable to think that, of a group of households with the same composition, 
those which are missed in the census will differ systematically in some characteristics from those 
that are enumerated. In other words, omission may be a form of nonignorable nonresponse. 
For example, households with lower incomes and educational levels may be more likely to be 
missed altogether, or to omit some members from their roster; income and education are not 
classification variables and therefore are not directly adjusted. 

Whole-household adjustments are represented in the proposed methods by upweighting 
households, preserving the values of all covariates. The implicit assumption is that the omit- 
ted households do not differ on these covariates from enumerated households with similar com- 
position. There is no information available in the block being adjusted to contradict this 
assumption. However, it should be possible to collect information in the PES on the differences 
between enumerated and missed households, which could be incorporated into the adjustment. 
For example, the income of wholly omitted households might be related to the mean income 
of enumerated households with the same composition by a linear regression; then the added 
(weighted-in) households could be imputed the income obtained by applying the linear regres- 
sion function to the income of the enumerated donor household. Little and Rubin (1987) discuss 
relevant methods for missing data problems with informative nonresponse. Another approach 
that is integrated with the weighting adjustment methodology is described in the next section. 

Within-household adjustments are represented by downweighting a household with certain 
enumerated characteristics and upweighting another household that contains an additional 
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member or members. In the absence of further adjustment, the characteristics of the upweighted 
household, rather than those of the enumerated household from which the weight was taken, 
will apply to the ‘‘weighted-in’’ component. 

This poses problems that cannot be resolved without collecting some data (from a subsam- 
ple of the PES). For example, if a child were omitted from the household roster, there is no 
reason to think this would lead to misreporting of income. If households with more children 
had a higher mean income than those with fewer children, then the weighting would tend to 
over-estimate mean incomes. 

If an adult were omitted from the roster, this might also mean that the same adult’s income 
(if any) would be left out of the reported household income. It is plausible that the mean 
unreported income in this situation would be positive but less than the mean income of the 
corresponding adults in households where all adult members appear on the roster. For a stereo- 
typical example, consider a family on public assistance that does not report an adult male 
member, whose income would otherwise be deducted from the assistance level, and whose 
residence is somewhat inconsistent. That member’s income is likely to be less than that of a 
permanently resident adult male in a family that does not depend on public assistance. Thus, 
neither the income of the enumerated household nor that of the ‘‘weighted-up’’ household 
would be an accurate imputation for the adjusted household. 

No direct correspondence is established between households that are down-weighted and 
those that receive additional weight. Thus an unadjusted income cannot be carried over directly 
from the enumerated household to the ‘‘weighted up’’ household. However, with some research 
comparing the incomes of enumerated and missed households, the incomes of down-weighted 
households could be used in adjusting incomes. For example, the mean household income of 
the reweighted block could be constrained to be equal to that of the block before adjustment. 


8.3. Weighting Adjustment of Non-classification Characteristics 


Suppose that adjusted summary information (by block) is available on some characteristics 
of households other than counts of individuals by adjustment class. For example, we might 
have an adjusted estimate of mean income or of the proportion of single-parent families, 
possibly from a regression model. As long as the summary statistic can be represented as a 
weighted sum of covariate values for each household, then conformity to the desired adjusted 
value can be imposed by a linear constraint on weights which can be made part of the weighting 
adjustment methodology of this paper. Thus, in the income example, we would constrain the 
weighted sum of incomes to equal the product of the number of households and the adjusted 
mean income. To adjust the proportion of single-parent families, we would constrain the 
weighted sum of 0-1 indicators for that status to the desired total count. 


8.4 Summary and Implications 


The methodology proposed will upweight households, and without further consideration 
of possible biases, will carry along the characteristics of the upweighted households. If the size 
of the adjustment and the biases introduced in household characteristics are both of small order, 
the overall bias in estimated block characteristics will be of second order and should not be 
a major problem. Some simple regression adjustments might make it possible to reduce the 
biases by an additional order of magnitude. 
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9. SUGGESTIONS FOR FUTURE RESEARCH AND DEVELOPMENT 
OF METHODOLOGY 


This section summarizes a number of suggestions for implementation and further develop- 
ment of this adjustment methodology. 


9.1 Post Enumeration Survey (PES) Data-gathering and Statistical Modeling 


Omissions of persons in enumerated and omitted households should be distinguished in the 
PES and the two omission rates modeled separately for each adjustment class. Rates of omis- 
sions of whole households should also be modeled (Section 4). A variety of measures (as in 
Section 6.2) could be used to compare the composition of ‘‘weighted-in’’ households to that 
of omitted households found in the PES; if research found that ‘‘composition bias’’ was a 
significant problem, higher-order statistics should be developed (Section 8.1). A sample of PES 
households that were omitted in the Census should be administered the long form, so that the 
relationship between omission and covariates such as income and education could be modeled 
for the adjustment (Sections 8.2, 8.3). 


9.2 Feasibility of Adjustments 


The methods of Section 5 should be tested and compared using PES data. 


9.3 Multiple Imputation 


Although the procedures proposed in this paper operate deterministically, there are a number 
of sources of uncertainty in statistics based on the weighted records. These include: uncertainty 
in estimation of undercount rates; variability in class undercount rates from block to block 
around the national mean; binomial variability in the actual number of omitted households 
or individuals around the expected number given the undercount rate; uncertainty regarding 
differences between covariate values for omitted households and for enumerated households 
that are weighted up to replace them. 

For research uses, files could be prepared that would represent all of these forms of uncer- 
tainty by multiple imputation (Rubin 1987). Two or more versions of the reweighted data set 
could be represented by including several sets of weights on the file. Researchers could repeat 
their analyses using each set of weights in turn. The variability among the statistics calculated 
on the different versions gives an estimate of the variability introduced by the process of under- 
count adjustment. Zaslavsky (1989) discusses procedures for multiple imputation in this setting. 


10. SUPPLEMENTS 


10.1 Choice of Objective Function for Weighting 


A number of objective functions have been proposed for calculating an optimal fitted table 
(usually in the context of contingency tables, cf. Fagan and Greenberg 1988). In each case the 
function takes the form T = Y 7,(W,), where 7; takes one of the forms displayed in Table 9. 
Each of these functions can be standardized to an equivalent function 7) by multiplication 
by a constant coefficient and adding a linear function of W, so that 7)(1) = 0, T§(1) = 0, 
Ty (1) = 1. Since } W; is constrained to a given value, the optimum weights will be 
unaffected. Then the standardized objective functions agree through the second term of their 
Taylor expansions about 1, and should give similar results when the weights are close to 1. 
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Table 9 
Comparison of objective functions for table fitting 


Objective Objective 


Name of fitting function function 7)(W), rains 
procedure T)(W), usual standardized T(W) 
form form 0 
Least squares (Ww — 12 (Ww — 1272 1 
(minimum variance) 
Raking W log W (W log W) —W + 1 1/W 
Maximum likelihood — log W W—1—log W 1/w? 
Minimum x7 (W — 1)7/W (W —1)7/2W 1/W? 


in the degree of asymmetry between the costs of downweighting and upweighting, determined 
by the exponent of W in the second derivative, Tg (W) = W-*. The least squares procedure 
(k = 0) treats up-and down-weighting completely symmetrically and therefore may yield zero 
or negative weights. As k increases, the cost of upweighting becomes smailer relative to that 
of downweighting. All of the other objective functions (k > 0) give every observation in the 
raw data a positive weight; in the case of the ‘‘raking’’ function, this is obvious from the form 
of the weights as shown in Section 3.3. The use of the ‘‘raking’’ function here in preference 
to maximum likelihood or minimum x? is motivated by the simple form of its solution and 
by the analogy to raking in contingency tables. Cressie and Read (1984) systematically study 
the properties of this family of measures of fit. 


10.2 A Cyclic Descent Methodology for Fitting Weights 


In this section we present a fitting methodology analogous to iterative proportional fitting 
(IPF) in contingency tables. In IPF, the cell counts are transformed multiplicati.ely in such 
a way that the cross-products are preserved (the condition for minimization of the objective 
function) while the table is made to conform to each set of marginal constraints in turn. The 
algorithm converges to a table that satisfies all of the constraints, and perforce preserves the 
cross-products as well (Bishop, Feinberg and Holland 1974; Ireland and Kullback 1968). 

In our setting, the weights are required to have the log-linear form W; = exp(aj — 1) 
derived in Section 3.3 while satisfying the constraints AW = B. In this exposition we will 
assume that the total weight constraint 5 W; = His omitted from AW = B, and that A is 
of dimension p (constraints) x J (number of household compositions). We will proceed 
through a series of steps in each of which each weight W; is multiplied by cp*/ to obtain a new 
weight W/ , thus preserving the log-linear structure; c and p are chosen so that the constraints 
YW; = Hand ¥ W/a;; = b; are satisfied. By proceeding cyclically so thaty = 1,2, ...p 
indexes each constraint in turn, the algorithm eventually converges to weights that satisfy all 
of the constraints. 

On step j of cycle t, the new weights are given by W,“) = cp%iW;%/~ (initialized for 
j = 1 by using the last weights from the last cycle, W;"°°) = W;,—!)), Then c and p must 
satisfy 


Y eRW EY Hy YY auco WY = by. (5) 


! l 
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Eliminating c from these equations, p is a root of 


WY, (Ha, 3 by ) Wis male (6) 


i 


We must have Hd; min <= b; = Hj max where @; min and Qj max are respectively the minimum 
and maximum values of a,;. If this were not the case, constraint j could not be satisfied with 
any weights. Thus there must be at least one root p, and if the aj are non-negative, the expres- 
sion is increasing in p so this root is unique. The actual value of p is determined then by Newton’s 
method, or by a closed-form formula for the roots of a polynomial (since with the original 
A, aj,is the number of class / members in a household, which is an integer rarely exceeding 2). 

While we have not yet proven that this algorithm always converges, we have found it to be 
successful in practice. This algorithm does not require any matrix inversion, and if the aj are 
small integers, then at each step, the recalculation of the weights involves calculating only a 
few integral powers. Furthermore, if some constraints take the form of simple marginals, the 
adjustment for those constraints takes the form of a conventional raking step. 

If the original constraint matrix A is used, the procedure may take advantage of the 
sparseness of A (which is a consequence of the fact that only a few classes are represented in 
each household). At each step (say, adjusting to fit margin b;), only the weights correspon- 
ding to non-zero a;; need be modified; thus only S, multiplications (the number of nonzero 
entries in A, which is less than the population of the block) and perhaps 3S, additions are 
required per cycle, as compared to 5S; + S, operations per iteration with Newton’s method. 
On the other hand, the rows of A tend to be highly dependent, so convergence may be slow 
(typically 20 cycles in our simulations); orthogonalization of A destroys the sparse structure 
of the coefficients. Thus, unless S, is much larger than S, (or unless some other method is 
devised to accelerate the algorithm), raking is not faster than Newton’s method. 
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QUID, A General Automatic Coding Method 


JACQUES LORIGNY! 


ABSTRACT 


The QUID system, which was designed and developed by INSEE (Paris) Institut National de la Statistique 
et des Etudes Economiques - National Statistics and Economic Studies Institute, is an automatic coding 
system for survey data collected in the form of literal headings expressed in the terminology of the respon- 
dent. The system hinges on the use of a very wide knowledge base made up of real phrases coded by 
experts. This study deals primarily with the preliminary automatic standardization processing of the 
phrases, and then with the algorithm used to organize the phrase base into an optimized tree pattern. 
A sorting example is provided in the form of an illustration. At present, the processing of additional 
coding variables used to complement the information contained in the phrases presents certain dif- 
ficulties, and these will be examined in detail. The QUID 2 project, an updated version of the system, 
will be discussed briefly. 


KEY WORDS: Automatic coding; Natural language variables; Phrase matching; N-grams. 


1. INTRODUCTION 


The QUID (abbreviation of QUestionnaires d’IDentification - Identification Questionnaires) 
system is an automatic coding system designed and developed by the Institut National de la 
Statistique et des Etudes Economiques (INSEE - National Statistics and Economic Studies 
Institute) in 1979-1980. 


Review of the Problem 

The problem consists of automatically classifying an individual surveyed into a job defined 
in accordance with an existing nomenclature (for example, the nomenclature of the profes- 
sions). In order to do this, the system uses mainly the natural language answer given in response 
to a direct question (for example, ‘‘What is your present profession or trade?’’), as well as 
additional information contained in the survey form, which is assumed to have been previously 
coded (for example, the Economic Activity code for the firm where the individual works). 

In our terminology, a direct answer in natural language is called the ‘‘literal heading’’, or 
simply ‘‘heading’’. Any additional encoded information is represented by the generic term 
‘additional variables’’. 

In the next section, we will discuss the basic approach of the QUID system and the results 
of its implementation at INSEE. In section 3, we will describe the present version of the system. 
Finally, in section 4, we will examine the problems surrounding the processing of additional 
variables, and will discuss the new version of the system (QUID 2), which should help resolve 
the difficulties encountered. 


' Jacques Lorigny, Administrateur a l’Institut National de la Statistique et des Etudes Economiques 18, Bld Adolphe 
Pinard 75675 PARIS CEDEX 14 (France). 
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2. THE PRINCIPLE BEHIND THE METHOD 


2.1 The Basic Approach 


The basic approach of the QUID system consists of building a very large data base made 
up of typical respondent headings accompanied by a corresponding code assigned by an expert. 
The data base is as large as possible in order to make it possible to obtain a high matching rate, 
and new headings are added to the base as they appear. 

In our terminology, the data base is called a ‘‘knowledge base’’ or ‘‘knowledge file’’ (KF), 
because it has the ordinary structure of a flat file in its raw state. Most often, the knowledge 
file is set up on the basis of a survey carried out during a previous year, which has already been 
coded either manually or using an interactive method. Each base heading is accompanied by 
its code (which is @ priori assumed to be accurate), and its ‘‘frequency of occurrence’’ in the 
KF; that is, the number of individuals who responded using this heading. 

The management task of the knowledge base (auditing, expansion) is completely separate 
from the operation of coding the survey under way. It is the responsibility of a central office 
staffed by expert coders, while the coding operation itself is most often regionally decentralized. 

The difficulties of an approach of this type derive from the rapid increase in the time required 
to search the base as it grows in size. In order to solve this problem, the QUID system uses 
mathematical results derived from Information Theory (Shannon 1948; C.-F. Picard 1972; 
B. Bouchon-Meunier 1978; M. Terrenoire 1970; D. Tounissoux 1980), which can be used to 
minimize search time by organizing the base in the form of an optimized tree structure. 

The basic approach of the QUID system also makes it possible to opt for a set of general 
programs; that is, those that can be used with all semantic fields, for example, professional, 
food products, or municipal headings. 


2.2 Results 


The system has been tried for various INSEE tasks and is presently being used to code the 
CS (socio-professional category) code in order to process DADS (Déclarations annuelles de 
données sociales - Annual social information) data provided by all firms that employ paid 
labour. The following figures provide an idea of the orders of magnitude involved. 

At present the knowledge file for the DADS application contains 122,000 headings (represen- 
ting a knowledge base population of 650,000 wage earners). Its optimized organization con- 
sists of a tree with about 100,000 nodes (of which 86,000 represent certainty nodes; see section 
3.2). It has been used to code a population of 570,000 wage earners with an average effectiveness 
of 90%, varying between 85% and 95%, depending upon the region. By ‘‘effectiveness’’, we 
mean the percentage of cases where the system provides a single answer which is accepted on 
principle under the conditions of this application. At present, since we do not have a precise 
measurement of the validity of these single answers, we estimate that the error rate is likely 
to be in the order of 5% to 10%. However, the knowledge base is being audited by the Dijon 
Expert Centre, according to which a significant proportion of the error rate should normally 
decrease. Once this has been achieved, we will have more accurate figures to report. 

From the point of view of data processing limitations, the optimized tree is loaded into 3,300 
kilobytes of central (virtual) memory and the automatic coding time for an individual case is 
in the order of 40 ms in an IBM 4341 central processing unit. 

For the last few months, we have had available a variant of the coding program itself. This 
has been designed for use with mini-computers and can load the tree by sections, depending 
upon available memory space. 
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In applications other than DADS data, effectiveness is not as high, no more than 75%. It 
all depends upon the quality and comprehensiveness of the knowledge base. 


3. THE PRESENT VERSION OF THE QUID SYSTEM (QUID 1) 


3.1 Preliminary Standardization of Headings 


Before constructing the optimized tree, the raw headings are first standardized in accor- 
dance with a set of external parameters chosen by the user for his application. 

The words are separated and fitted into predetermined zones whose length (a single one for 
all words) and maximum number (a single one for all headings) are parametrized. It is advisable 
to choose a larger value on the basis of these two parameters, and allow the optimization 
algorithm itself to select the significant elements of the heading (see section 3.2). For example, 
the DADS application (see section 2.2) chose 4 zones of 12 characters each. 

‘‘Empty words’’ are eliminated. The list of empty words is an external parameter provided 
by the user for his application. Most often, it includes articles, prepositions, efc., and is 
significantly dependent upon the application. 

Initials are standardized (I.N.S.E.E. becomes INSEE, S N C F becomes SNCF). 

Finally, the user may process the table of separate words in any way he wants (in the form 
of asubprogram in the PL/1 language). In fact, this is rarely necessary and seldom used (except 
to code municipal codes from municipal headings). 

Once word processing has been completed, the words are divided into bigrams (blocks of 
two consecutive letters) or trigrams (blocks of three consecutive letters), etc. Choosing the type 
of blocking is parametrized (however, a single parameter is used for the entire application). 
In practice, blocking into bigrams is the only type that has been used until now; however, the 
idea of blocking into trigrams should be tested. For the purposes of this study, we will only 
consider blocking into bigrams. 


3.2 The Algorithm Used to Set Up the Optimized Tree Pattern 


Let T = (t,,b, ..., fj, ..-, ¢,) the code to be coded, for example all the modalities of the 
Profession code. 
O = (41:42, -- +> Gir - ++» Um) all the bigrams resulting from the standardization of the 


headings (for example, m = 24 when the number 4 has been chosen as the 
‘‘number of words’’ parameter, and 12 characters as the ‘‘word length’’ 
parameter). 
X = thetree pattern to be constructed, which we call a ‘‘QUID”’ (questionnaire d’iden- 
tification - identification questionnaire). 
The algorithm constructs X by parsing down from the root node Xo (which by convention 
is placed at ‘‘level 0’’) to the nodes in levels 1, 2, etc. 
At root node x it links the entire KF, and searches for the best bigram to query first; that 
is, that which can discriminate best for the desired code TJ in the entire KF. 
N (Xo) represents the total frequency of occurrence associated with the entire KF; that is, 
the sum of frequencies accompanying the base headings, 
N (Xo, /) is the frequency of occurrence of code ¢; in the entire KF. 
We assume that the knowledge population is statistically representative of the population 
to be coded (we should recall that, in practice, the KF is often the survey file for a previous year). 
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Thus, we can estimate the probability of finding code t; the population to be coded on the 
basis of the following formula: 


Pr(t; | Xo) = N (X0.J)/N(%). 
The a priori ambiguity for T is measured on the basis of Shannon’s entropy: 


TCE Xo ax ye Pr(t; | Xo) log 1/Pr(t; | xo). 
J 


Let us assume that a bigram, for example q; is allocated to node Xp. To each of its modalities 
in the KF, we associate the sub-base made up of the headings that have this modality. 

Let (a}, a?, ..., a¥, ...) represent the modalities captured by bigram q;in the KF. For each 
of these modalities, thus, for each of the sub-bases generated, we create a node y, which follows 
immediately after x and is located at level 1 of the tree. 

The information provided by bigram q; (which is assumed to be assigned to root node x9) 
is measured by the average reduction in the ambiguity of 7 when we go from xp to one of the 
y nodes. 


That is: 


I(xoT,g)) = A(T |X.) — Y) Pr(y) A(T | y), 


vert (x) 
where 


I'(Xo) represents all the successive y nodes at level 1 below node xo 

H(T | y) the conditional entropy of 7 at node y. 
(same formula as above but replacing xp by y). 

Pr(y) = N(Xo,a*)/N(Xpo) if ak is the modality of bigram qg; which generates node y, and 
N(xXo,a*) is the frequency of occurrence of modality aX of bigram q; in the entire 
KF. 


The algorithm carries out this data calculation for all bigrams q;, q, ..., Gm, because at 
root node Xp they are all possible candidates for selection as the first bigram to be queried. 
The algorithm chooses the bigram which maximizes I (Xo, 7,q;). For example, in the case 
of dio, it effectively divides the base into as many sub-bases as there are modalities of bigram 
dio in the base. This effectively creates the y, nodes that follow xo at level 1, and the construc- 
tion of level 1 of X is thus completed. 
For each sub-base obtained (thus, for each y node), the algorithm carries out exactly the 
same operation as that which we have just described for root node xp, and so on. 
The process stops for a given node: 
(1) when there is only one heading at the node; in this case, the conditional entropy is zero; or 
(2) when there is only a restricted number of headings that differ in terms of the remaining 
bigrams, but which have all the same code; or 
(3) when there are two headings or more, but they have different and not distinguishable 
codes. 
Cases (1) and (2) are known as “‘certainty nodes’’, and case (3) is known as the ‘‘uncertainty 
node’’. Together, they represent the ‘‘terminal nodes’’. 
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Standardized Heading: 


pou] ee | | | ole a = | | ig LS i all fh 


Wee 3 4 5 6 Ui Sr ae@tl@ Vee fo tae Lo 


Bigram No. 
yoy Ses 
2=AD eco Q2=EF eo30e I=NS 
J] / </ 
\ 
¢ M aN 7 / : Vf ~ Kk / tN 
/ ‘Sites 
1 é] ] 
“i oe Ks ’ whe 6 } / ae 
6 bom fi 
eco eee (-7022) 


Query the content of bigram no. 2 in this node of the tree. 


2=EF The content of bigram no. 2 is EF. 


In this node of the tree, we may determine the profession code: its value is 7622 (1975 Trade 
Nomenclature). 


In this example, the raw heading is that of the profession entered by the individual surveyed. The objective of the 
system is to find the corresponding profession code in the 1975 Trade Nomenclature. 


Initially, we extract the first ten characters of the three most significant words. In this way, we obtain the standard- 
ized heading, which is then blocked into pairs of letters (these are called bigrams and are numbered from 1 to 
15). Then, we query the system. This operates in accordance with a chain of questions and answers optimized 
by a mathematical algorithm based on information theory. This calculation takes place during the course of a preliminary 
phase which determines the first bigram queried as a function of a given knowledge file, and then the following 
sequence of questions depending upon the answer obtained each time. At this point, the computer queries first 
bigram no. 12, which contains TR. At this stage, it ascertains that it can without ambiguity determine that this repesents 
the Profession 7622 code (Technical Staff and Technicians). On the average, processing time takes a total of 
41 milliseconds of computer time in an IBM 370/148, and the amount of central memory used is 380 Kbytes. 


Raw Heading: head, maintenance team. 


Figure 1. Example of Classification of a Heading in the Tree. 
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The construction of the tree X continues from level to level until the KF has been exhausted. 
In fact, we have never gone beyond level 15, but there is no set limit for the system itself. An 
example of classification in the tree is shown in Figure 1. 


3.3. The Use of the Coding Itself 


In order to code a heading in the current survey, we start by standardizing the information 
in accordance with section 3.1. Then, the bigrams obtained are matched against those of the 
Quid loaded into the computer. The exploration leads to three possible results. 


3.3.1 Certainty Node 

The system provides a single code but this may well be wrong if the knowledge base is not 
comprehensive enough. For example, during one of our first tests in 1979, we obtained a cer- 
tainty node for level 1 on the basis of bigram 2 = CC, since the only heading obtained had 
been VACCINEUR VOLAILLES (POULTRY VACCINATOR). 

When we later had to code the heading RACCOMMODEUR VETEMENTS (GARMENT 
MENDER) the single code obtained was that representing agricultural service professions, and 
the error was obvious. 

Thus, we added to the system a control procedure based on single echoes. This process is 
known as ‘‘redundancy control’’ and consists of verifying, after the detection of a single echo, 
the content of the first three bigrams of each word. A single echo (obtained on the basis of 
the vector leading to a certainty node) is said to be non-ambiguous, when the cluster of headings 
in the certainty node contains at least one heading that has the same redundancy bigrams as 
those of the heading to be coded. Otherwise, the echo is said to be ambiguous, and consequently 
treated as an anomaly of the automatic system. Experience has shown that this arrangement 
tends to consolidate significantly the reliability of the system without appreciably overburdening 
the tables in memory or increasing processing time (even in large applications, the number of 
redundancy formulas per certainty node is, on the average, in the order of one, and rarely goes 
beyond ten). 

In order to be thorough, we should add that this redundancy control is not rigidly set once 
and for all. The user has two external parameters: the list of bigrams over which he intends 
to exercise control, and the (maximum) number of bigrams retained. In this way, he can keep 
in check the severity of the matching control, depending upon his objectives in terms of the 
quality and ‘‘effectiveness’’ of automatic coding. 


3.3.2 The Uncertainty Node 

The system provides various possible codes (most often two codes), and displays their respec- 
tive frequencies of occurrence at the node under consideration. In this case, the officer who 
has the file of the survey being processed will then manually reject one of the two. 


3.3.3 The Case of an Unknown Response 

If, during the course of exploring the Quid, the modality sought is not found in the modalities 
captured by the bigram queried, the search will fail and this also represents a case of rejection 
that must be processed manually. 

New cases encountered during the course of processing will be stored in memory, centralized 
in the expert centre, verified, and then incorporated into the KF in order to produce a new 
expanded version of the Quid. 

At present, for purposes of convenience, the knowledge iteration takes place once a year, 
but nothing prevents it from being organized so that it takes place more often so that applica- 
tions can progress faster, for example in the case of population surveys. 
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4. THE PROBLEM OF PROCESSING ADDITIONAL VARIABLES 


In the present version, QUID 1, additional variables are simply structured into bigrams and 
processed in the same way as literal data. This leads to certain difficulties and problems that 
made it necessary to develop a new version, QUID 2, which operates in two stages: 

- in the first stage, QUID 1, which is reserved for processing the literal heading and pro- 
ducing either the final code (when this is totally determined by the heading), or an internal 
code designating a rule or decision table that can be applied to the additional variables 
to achieve the calculation; 

— in the second stage, the rules or decision tables achieve the determination of the final code. 


Detailed Examination of the Difficulties Encountered 


At times, certain nomenclatures that are particularly complex, such as the PCS Code 
(Nomenclature of Professions and Socio-Professional Categories) call upon a combination 
of the literal heading and various additional variables. 

For example, the coding of the PCS code uses the Professional Category additional variable 
(which is abbreviated to CPF). The following is the question such as it appears in the 1982 
Population Census Individual Form: 

Indicate the professional category of your present job: 


- unskilled or semi-skilled labourer ] 
— labourer - semi-skilled labourer (OS, O1, O2, O03, ...) 2 
= skilledslabourer(P1L,.P2,.P3;0 A, OP, OO .....).3 
- clerk 4 
- technician, draftsman 3 
- supervising workers or clerks 6 

- foreman — supervising other foremen or 
technicians Fh 
— engineer or professional staff 8 


The additional question was made necessary by the fact that the heading alone is not always 
enough to classify the individual in accordance with PCS nomenclature. 
For example, a LUMBER COMPANY WORKER 


- must be classified into 6916 (lumber company or forestry worker) 
tenis CPF 1s.1, 2, 3, or 4 

- and into 4801 (Managerial and supervisory staff of agricultural or lumber operations) 
if his CPF 1s 5. 6, 7, Oris: 


The present system considers these additional variables as if they were literal data. They 
are placed at the end of the heading and structured into bigrams in the same way (for example, 
the CPF variable with the addition of a blank space is placed into the (m + 1)th bigram). How- 
ever, this solution is not satisfactory and leads to various errors: 


Error No. 1. When there is not enough information in the KF, this may lead to many cases 
of unknown responses. 

For example, if the KF has only one LUMBER COMPANY EMPLOYEE witha CPF = 2 
and another with a CPF = 7 the file will be unable to find a LUMBER COMPANY WORKER 
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with a CPF other than 2 or 7 (that is, a priori in 6 cases out of 8). This error is made worse 
when the additional variable is very diluted, for example, in the case of the variable represen- 
ting the Economic Activity of the undertaking (which is abbreviated as additional variable AE). 


Error No. 2. When there is not enough information in the KF, this may lead to miscodings. 

For example, if the KF has only one LUMBER COMPANY WORKER witha CPF = 2, 
the CPF bigram will not discriminate or appear in the search key, so that a LUMBER COM- 
PANY WORKER witha CPF = 7, will be classified into PCS = 6916 instead of 4801. This 
is a case of miscoding 

In order to correct this defect in the present system, the only measure we can take is to apply 
the redundancy control to the additional variables (and thus obtain an ambiguous or ques- 
tionable case which is rejected or corrected manually, instead of allowing the error to remain 
undetected). However, here again, this is only a last resort. In fact, the additional variables 
lead to an unchecked expansion of the KF. Each KF reference has its own cross combination 
of modalities of additional variables, and it is not very likely that we would find the same com- 
bination for a new individual to be coded. Thus, this will lead to many uncertain cases and 
automatic coding rejections, which will reduce the practical benefits of mass exploitation. 

The two errors, no. 1 and no. 2, are related to the relative incompleteness of the KF. For 
example, it would be enough to enter into the KF eight LUMBER COMPANY WORKER titles 
and add in each case one of the possible CPF modalities (1 to 8), in order for the two errors 
to disappear. However, in the case of real applications, we find that the relative incompleteness 
of the KF decreases quite slowly, as it grows to reach its operating pace. Contrary to the lex- 
icographic space of literal headings, which tend to become dense rather quickly, the cross 
checked space of the additional variables remains a vast frontier for a long time, and goes very 
slowly from a density of occupation of 0 to a density of 1 (one individual). 


Error No. 3. There is a third category of errors that are not caused by the incompleteness 
of the KF but by the excessive sensitivity of the QUID in relation to errors inevitably contained 
in the file (and this always in relation to the additional variables). 

Let us take a simple example. Let us assume that the SENIOR SECRETARY heading must 
be coded PCS = 4615 (senior secretarial staff), regardless of the value of all the additional 
variables. Let us consider the following KF, in which an error has slipped by (for example, 
the failure to assign the PCS code): 


Heading CPF a.v. AE a.v. PCS Code 


Senior 
Secretary [7 | | 4 1 1| 4615 
(fashion design, 
haute couture) 


Senior 
Secretary [7 | | 83] 43] 4616 
(loan | 
cooperative) error 


Even though the AE additional variable should not be used to code the PCS code, the QUID 
algorithm uses it to separate the two certainty nodes. 

- One in favour of 4615 in view of bigram AE1 = 49. 

~ And the other in favour of 4616, in view of bigram AE1 = 83. 
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The result is that, during the coding stage itself, all senior secretaries belonging to economic 
sectors other than those starting by 49 or 83 will appear as ‘‘unknown cases’’. Moreover, those 
in all sectors starting by 83 will obviously produce errors. However, it is mainly the first 
phenomenon that interferes with accuracy, because it affects an area that is much larger than 
that affected by the initial error. 


Error No. 4. Finally, the present QUID algorithm is excessively rigid in terms of choosing 
the optimal question. Most often, this results in a simple inversion of the order of the ques- 
tions in the course of the search, in relation to the order that would have been preferred by 
the designer. Thus, the effect is secondary, since the final results are identical. However, this 
may also lead to more serious distortions. 

Let us take the following (partly fictitious) example. Let us assume that, according to the 
nomenclature, the SENIOR SECRETARY heading should be coded either PCS = 4615 as 
above if the CPF additional variable CPF = 1to7,andPCS = 3726 (current managerial staff 
in other administrative business services), if CPF equals 8. 

Let us examine the KF containing the following two references: 


Heading AE a.v. CPF a.v. PCS Code 


Senior 


Secretary [49] 11] | 8 3726 


Senior 


Secretary |83|44 | 7| 4615 


Thus, the two references are correctly coded. When the QUID algorithm arrives at a node 
where it has examined all the possible bigrams of the literal heading, it must now choose one 
bigram in the additional variables, in order to separate the two final results: PCS = 3726 and 
PCS = 4615. Inthis simple but not altogether unrealistic example, the three possible bigrams: 
AE1, AE2, and CPF, provide the same quantity of information (one bit). In our algorithm, 
the arbitrary convention is that in cases of equality, the program should choose the first ques- 
tion in the order in which the additional variables were presented in the form. However, in 
this example, this will be deceiving, since we would encounter the aberration discussed above 
(error no. 3). However, it is not possible to determine an order of additional variables that 
would prevent this type of error in all important cases. We can only seek an order of questions 
that will be statistically the least invalid, by groping our way on the basis of the order of con- 
ceptual splits, the negentropic capacity of each additional variable, etc. 


5. CONCLUSION 


In its QUID 1 version, the present QUID system provides very valuable services to INSEE. 
Nevertheless, it still has certain weak points regarding the processing of additional variables. 

The new QUID 2 version should improve processing while remaining faithful to our “‘basic 
approach’’ to the automatic coding problem, which could be summarized in two points: 
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1. Separation of the knowledge base (in this case, a base of rules and decision tables that are 
written in natural language, are independent of each other, and are audited and managed 
by an autonomous expert centre), and the use of automatic coding programs (in this case, 
loading and table exploration programs). 

2. Construction of general programs; that is, programs that are independent of the semantic 
field processed. 


At least, these are the objectives that we try to attain. 
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ACTR 
A Generalized Automated Coding System 


M.J. WENZOWSKI! 


ABSTRACT 


A generalized implementation of a method for performing automated coding is described. Traditionally, 
coding has been performed manually by specially trained personnel, but recently computerized systems 
have appeared which either eliminate or substantially reduce the need for manual coding. Typically, 
such systems are limited in use to those applications for which they were originally designed. The system 
presented here may be used by any application to perform coding of English or French text using any 
classification scheme. 


KEY WORDS: Automated coding; Classification; Text searching. 


1. INTRODUCTION 


Automated coding refers to the process by which text is machine analysed in order to assign 
it a classification, or code. To be practical, automated coding systems must be capable of 
coping with such problems as: rearranged words, plural vs singular forms, missing words, 
extraneous words, spelling variations, synonyms, abbreviations, inconsistent hyphenation 
and variable punctuation and syntax. In addition, in searching a text database for a match, 
they should be capable of determining the closest match when no identical match can be found. 

Generalized systems provide all of the features required, packaged within an easy to use, 
flexible, and efficient framework. To use a generalized system for a particular application, 
no development or conversion effort is needed to tailor it to the application specific 
requirements. As well, no application sponsored support for the maintenance of a generalized 
system is necessary, since the package is supported and maintained by a central agency. 

ACTR (an acronym for: Automated Coding by Text Recognition) employs techniques 
similar to those employed in other automated coding systems currently in production at 
Statistics Canada (Landry and Pidcock 1984), but is unique in that it has been generalized 
to allow it to be used by any application to assign codes based on the input of English or French 
text according to any classification scheme. 

The methods which ACTR uses to perform automated coding are based on techniques 
which were originally developed at the U.S. Bureau of the Census (Appel and Hellerman 1983). 
Basically stated, the method consists of searching through a collection of text previously 
associated with correct codes. If the subject text is successfully located, the associated code 
is returned and the process ends. Otherwise, the search continues, but uses an algorithm to 
locate the closest match, and subsequently assign its associated code. 


1 M.J. Wenzowski, Research and General Systems, Statistics Canada, Room 2306, Main Building, Ottawa, Ontario, 
KIA OT6. 


300 Wenzowski: ACTR Generalized Automated Coding 


2. USING ACTR 


To use ACTR in an automated coding application users first need to define the text and 
associated codes which they intend to use as a standard for matching. While there are many 
sources for this information, the best is a set of text which is representative of the text which 
will most likely be encountered in a matching run. For a survey, this generally means the 
responses and manually assigned codes from a previously completed survey. Although great 
care should be taken to ensure that the correct codes have been assigned, the text should be 
left as is, complete with spelling, grammar and syntax errors, since in this form it is most 
representative of the text which will be encountered in subsequent surveys. 

After having defined a file of text and correctly assigned codes, they must be loaded into 
a matching database. ACTR provides the software required to perform this task and so 
automatically transforms the file into a matching database. 

ACTR has been designed to allow an iterative approach to developing an automated coding 
application. Accordingly, text and codes can be added, changed or deleted at any time during 
the life of the application. In addition, the parsing strategy (discussed in detail below) can be 
altered at any time. Thus, users are presented with a software framework which, through cycles 
of database updates and matching runs, will allow for as many iterations as is necessary to 
obtain the matching quality desired. Users are encouraged to use ACTR in this manner, since 
ultimately it leads to higher quality and more economical coding operations. 


3. PRINCIPLES OF OPERATION 


In the case of a human being performing a coding operation, the similarity between occupa- 
tions described as ‘‘Computer Programmer’”’ and ‘‘Programming Computers”’ is so great that 
they would generally be judged as identical. However intuitive this reasoning may seem, com- 
puter systems in general would rate the two as unequal. Unfortunately, natural language (for 
example, English or French) frequently provides a large number of ways to express the same 
meaning. So, for a computer based system to be able to cope with this variance, there must 
be some means by which a degree of similarity can be determined. 

This is the essence of ACTR: text is rated according to how similar it is to some other text. 
In the preceding example, ACTR treats the two occupation descriptions as identical since, after 
suffixes are truncated, double letters are removed and word order is ignored, both phrases 
become ‘‘Comput Program’’ and as such are clearly equal. 

The steps employed in reducing the above phrases to a standard form are part of what is 
known in ACTR as the parsing strategy. ACTR’s parsing strategy is entirely user controlled 
and may be changed at any time during the life of an application. Users exercise control over 
the parsing strategy employed in their applications by supplying the data which is to be used 
to direct the process. This means that all steps are entirely controlled by the user, even to the 
extent of allowing a step to be skipped. 


The Parsing Strategy 

Parsing is the ACTR process which is responsible for the reduction of phrases to a standard 
form. Ideally, the resulting form should be such that any two phrases with the same words 
will be identical in their ACTR representation regardless of their syntactical and grammatical 
differences. Returning to the previous example, the two phrases ‘‘Computer Programmer’’ and 
‘*‘Programming Computers’’ when properly parsed, should ideally result in a set of identical 
words for each phrase. For example, both phrases could be reduced to ‘‘Comput Program’’. 
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The parsing process employed may involve the reduction of plural forms, elimination of 
trivial words, removal of suffixes and/or a number of other steps. Although the order of the 
parsing steps applied is fixed by ACTR, users control how, if at all, each step is executed. For 
further information on the order of parsing, the interested reader should consult Connor, 
Salloum and Wenzowski (1988). 

Basically, the parsing process can be thought of as having the following two major subcom- 
ponents: 


1. TEXT PROCESSING. In this stage of parsing, the text supplied is processed as a 
continuous stream of characters. Although one may think of the text as containing 
words, spaces and punctuation, none of these is given any special consideration at 
this point in the parse. This view is necessary in order to allow for the recognition of 
particular character strings exactly as they occur in situ. 


2. WORD PROCESSING. When this stage of the parse begins, the text has already been 
broken down into words and so further processing is performed on a word by word 
basis. This view is necessary since a large amount of text standardization occurs on 
the basis of defined words. 


Text Processing 

As already discussed, these steps are performed regardless of context. Thus, the following 
steps are performed on a character by character basis. 

Exclusion Clauses: Exclusion clauses are ignored in matching, but are used in database 
updating to indicate the intention of allowing controlled duplication of phrases. By default, 
ACTR will not allow identical phrases to be loaded into a matching database. 

By providing a means of controlling duplication, users are able to load phrases which could 
have more than one code assigned, even though they are identical after having been parsed. 
Although not used in matching, exclusion clauses are stored along with the phrase in the mat- 
ching database and can subsequently be used to manually resolve multiple matches. 

The syntax of an exclusion clause is defined entirely by the user. Both beginning and ter- 
minating strings must be provided. These and any information enclosed by them are ignored 
during matching. 

As an example, consider an exclusion clause syntax defined with a beginning string of 
““(Except’’ and a terminating string of ‘‘)’’. With this in place, the two phrases ‘‘Computer 
Programming (Except As An Employee)’’ and ‘‘Computer Programming (Except As Self- 
Employed)’’ could co-exist in the matching database, even though their ACTR representations 
are identical. Subsequently, if a match for ‘‘Computer Programmer”’ is requested, both of 
these phrases would be returned. Since exclusion clauses are stored along with the original phrase 
text, they can be displayed to a reviewer, who could then manually resolve the match. 

Deletion Strings: If any deletion string supplied by the user is found in any position in a 
phrase, ACTR will remove it from consideration before continuing the parse. 

As an example, in English processing, this is a way in which the apostrophe can be removed. 
For example, the two phrases ‘‘Electrician’s Apprentice’ and ‘‘Apprentice Electrician’’ would 
become identical with the removal of the apostrophe. 

Note that if this step were not performed, the apostrophe would most likely be used as a 
word delimiter. This would yield three words for the first phrase and two for the second, of 
which only one word would be common to both. 

Replacement Strings: This facility is most useful for standardizing abbreviations. This is 
desirable since abbreviations commonly include characters which, although useful to the 
abbreviation, would be viewed as word separators at a later stage in the parse. If this were 
allowed to happen, information loss would most likely occur. 
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As an example, if the string ‘‘T.V.’’ was defined with a replacement value of: ‘‘Television”’ 
then any occurrence of the original string would be translated to the replacement value before 
continuing the parse. 

Note that if this step were not performed, the result of parsing ‘‘T.V.’’ would most likely 
be the two letters ‘‘T’’ and ‘‘V’’. This is clearly undesirable, since the meaning of the abbrevia- 
tion has been completely lost. 

Word Characters: ACTR defines a word as any contiguous sequence of characters in a phrase 
which are all members of the set of characters contained in the word character list. Any 
characters not in this list will be used as word delimiters and will be dropped from further con- 
sideration. 

Typically, the set of word characters used contains all of the letters of the alphabet and all 
of the numeric characters. With this in place, a phrase of ‘‘Farmer/Fisherman’’ will result in 
two words, since ‘‘/’’ is not a word character and is therefore used as a word delimiter. 


Word Processing 

At this point, ACTR begins to treat the text as a collection of words. Thus, the following 
processing steps are applied on a word by word basis. 

Hyphenated Words: Any hyphenated words supplied are replaced by the subtitute word(s) 
also provided. This feature is very useful in providing for the recognition of words and word 
groups which are inconsistently hyphenated. 

As an example, if the user defines ‘‘Take-Out’’ as a hyphenated word with a substitute word 
of ‘‘Takeout”’ then this substitution will be made. If, on the other hand, this definition had 
not been made, then two words would result if the hyphen was not a word character. 

Illegal Word Characters: If any of the strings supplied are found to exist in any word in 
any position, then that entire word is removed from further consideration. 

As an example, some applications use this feature to eliminate words which contain numeric 
characters. So, if the set of numeric digits was given as illegal word characters, then a word 
like ‘‘DEPT716A’’ would be removed from further consideration. 

Replacement Words: This feature provides a synonym capability in order to ensure that two 
dissimilar words will be recognized for matching purposes. This can also be useful to over- 
come commonly occurring spelling mistakes. 

As an example, if the phrases ‘‘Automobile Repairs’’ and ‘‘Car Repairs’’ were processed 
with the word ‘‘Car’’ given as a replacement word for ‘‘Automobile’’ then the two phrases 
would be made identical. 

Double Words: This feature forces ACTR to consider not only the occurrence of the two 
word grouping, but their order as well. This can be useful to overcome inconsistencies in word 
spellings and also to preserve word order. 

As an example, consider the phrase ‘‘Take Out Restaurant’’. Although this would yield three 
perfectly acceptable words, the words ‘‘Take’’ and ‘‘Out’’ would not match to either of 
‘“‘Takeout’’ or ‘‘Take-Out’’. However, if a double word combination of ‘‘Take Out’’ was 
defined with a replacement of ‘‘Takeout’’ then the first case in the example given is addressed. 

We are presented here with an example of how steps in the parsing strategy can be used 
together. If the hyphenated word example given above was also entered, then all of the 
hyphenated, double word, and single word cases would match. 

Trivial Words: If any word in this set is encountered in the course of parsing, then it will 
be removed from further consideration. 

As an example, if the set of trivial words contained ‘‘A’’, ‘‘Am’’ and “‘I’’, and the two 
phrases ‘‘I Am A Computer Programmer’’ and ‘‘Computer Programmer’’ were encountered, 
then the phrases would match. 
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Suffixes: At this point, words are scanned right to left looking for the longest defined suffix 
such that the remaining word, after the suffix is removed, will be at least five characters in 
length. If a defined suffix is found, it is removed. 

As an example, if the suffixes ‘‘ing’’ and ‘‘er’’ are defined, then the phrases ‘‘Computer 
Programming’’ and ‘‘Computer Programmer’’ will match. 

Replacement Suffixes: Replacement suffixes are searched for in a word by scanning right 
to left for the presence of the longest defined replacement suffix. If one is found, it is removed 
and the substitute supplied is used in its place. 

As an example, the user may wish a plural form to be reduced to a singular one so that the 
singular suffix will be recognized in the suffix truncation step. This is demonstrated with the 
phrases ‘‘Battery Manufacturing’’ and ‘‘Manufacturing Batteries’. If the suffix ‘‘ies’’ is 
changed to ‘‘y’’ then not only will the phrases be the same, they will be processed in the same 
manner at suffix truncation time. 

Double Letters: At this stage in the parse, each word is examined for the presence of any 
double character occurrences which are contained in the (user-defined) double letter set. If any 
are found, they are reduced to a single occurrence. 

Typically, the double letter set used is the full set of alphabetic characters. If this is the case, 
then the words ‘‘Programer’’ and ‘‘Programmer’’ would match, in spite of the spelling error. 

Root Words: At this point, words are scanned for the presence of any of the root words 
supplied. The scan is applied from left to right in the word, and searches for the longest defined 
matching root word. If one is found, then its substitute is used as a replacement for the word 
and the suffix truncation and replacement steps are skipped. 

As an example, the languages ‘‘Slavee’’ and ‘‘Slavic’’ differ only in their last two characters. 
So, if the suffixes defined include ‘‘ee’’ and ‘‘ic’’ then an information loss occurs, since both 
words will become identical. Although generally, suffix truncation works well for most applica- 
tions, it quite clearly fails for this particular example. To overcome this problem, if root words 
of ‘‘Slave’’ and ‘‘Slavi’’ are defined, then the suffix truncation step is bypassed for these cases 
only. Thus, as suffix truncation problem cases are identified, root words and their substitutes 
can be defined to overcome them. 

Duplicate Words: Finally, the set of words resulting from the parse of the supplied text is 
examined for the presence of duplicates. 

Note that words which are duplicates at this point may not have appeared as duplicates before 
the text was parsed. Only one occurrence of each word defined at this point in the parse is kept. 


4. SEARCHING AND MATCHING METHODS 


ACTR always processes the supplied text according to the parsing strategy defined before 
attempting a match. If after doing this, ACTR is able to locate a phrase on the matching 
database with all of its words in common with all of the words in the supplied text, then the 
match found is referred to as a ‘‘Direct Match’’. If a direct match cannot be found, ACTR 
may, as a user option, continue to search the database for the closest match. This latter type 
of match is called an ‘‘Indirect Match’’. Although they share a common foundation in that 
they are both based on parsed text, the two matching methods used by ACTR differ greatly 
in their mechanisms for both locating and assigning a match. 


Direct Matching 

In direct matching, only a 100% match is searched for. Recall that matching is based on 
parsed text, so phrases which are 100% matches may not appear to be identical in their original 
form. This is a direct effect of the parsing strategy in use. 
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In terms of database access techniques, the fastest path to an item is through the use of a 
key. Unfortunately, the roadblocks to keyed access of ACTR phrases exactly as they occur 
include a maximum phrase length of 200 characters and an upper limit of 20 on the number 
of parsed words. These two items make keyed access impractical since the extreme length of 
the key would negate any benefit derived. The only alternative to keyed access is sequential 
access, but this is undesirable because of the time required to search through the large volumes 
of information generally contained in a matching database. 

So, we are presented with no other alternative but to somehow reduce the size of the key, 
thus making keyed access viable. There are many well known data compression techniques 
which could be used to do this, a general survey of which can be found in Reghbati (1981). 
In ACTR, the required data compression is achieved by forming the ‘‘compressed phrase key”’ 
or CPK. How CPK’s are actually formed is discussed below. Accept for now that CPK for- 
mation results in a key which is approximately 35% of the original size of the phrase. The CPK 
can thus be used to access the matching database with an efficiently sized key in order to deter- 
mine whether any direct matches exist. 


The use of the CPK in ACTR is significant in the following ways: 
1. All 100% matches will always be located using this method. 


2. Since ACTR is able to locate direct matches by using the most efficient means 
possible, matches made by using this method are both faster and cheaper to 
perform. 


3. As applications mature, the proportion of direct matches generally increases due to 
ongoing database update activity on the part of the user. Thus, overall 
matching costs for an application can actually decrease as the application 
matures, even though the size of the matching database may increase. 


CPK Formation 


The CPK is formed by first ordering the words defined in parsing. The actual order is 
arbitrarily chosen and so is not significant, as long as the same ordering applies for all CPK 
formations. (The order used happens to be in ascending order of the collating sequence in use.) 

After ordering, the words are concatenated into a single string which contains no blanks. 
This string is then compressed in order to form a short enough string to allow for efficient use 
as a database retrieval key. The compression of the string is based on the following: 


1. The words resulting from parsing generally contain only characters from the 26 alphabetic 
character set and the 10 character numeric set. (Recall that the actual set of characters which 
may be encountered in words is user-defined.) However, characters are stored internally 
(ie. in memory and on disk) using an 8 bit code. Thus, there are 2° or 256 possible 8 bit 
code combinations while ACTR words typically use no more than 36 of these. This leaves 
a 220 code surplus which could be used for other purposes. 


2. Certain double and triple letter combinations are known to occur more frequently than 
others in English and French text samples. In ACTR, the double letter combinations are 
known as ‘‘digrams’’, and the triple letter combinations are known as ‘‘trigrams’’. 


3. The 220 ‘‘free’’ codes can then be used to replace the digrams and trigrams described above 
as they occur in text samples. 


4. Starting with the concatenated, parsed words, ACTR scans for the presence of any of the 
predefined digrams and trigrams. If any are found, they are replaced with the associated 
8 bit code. The result is that a character sequence which formerly required 16 or 24 bits of 
storage, now requires only 8 bits. 
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Indirect Matching 


Like direct matching, indirect matching begins with the set of words resulting from the par- 
sing process. However, indirect matching can never be as efficient as direct matching since the 
concept of closest match is relative. That is, we cannot find the closest match without first per- 
forming an exhaustive search through all of the possible matches. 

In order to perform indirect matching, the matching database must first be searched for 
each of the words resulting from the parsing process in order to determine which, if any, are 
known. Following this step, for each word in the supplied phrase which is known to the 
database, all phrases containing the word must be retrieved and evaluated. 

The nearest matching phrase is determined by calculating a score for each of the possible 
matches. Scores are based on the weights of the words which are in common with the database 
and subject phrases. Of all database phrases evaluated in this manner, the highest scoring phrase 
is the one which is considered to be the closest match. 


Word Weight Calculation 

For each word known to the database, ACTR calculates a matching heuristic, or weight. 
These weights are an indication of the usefulness of a word in assigning a code and act as com- 
ponents in the phrase score calculation process. 

The method by which word weights are calculated is based on: n, a count of unique codes, 
whose associated phrases contain this word; V,, the relative frequency of code i from previous 
surveys; X;, a count of the number of word occurrences for phrases with code i; P,, the pro- 
portion of this word in code i, calculated as V; x X;/Lj=; V; X X; ; EW, the entropy of the 
word, calculated as - L?_, P; x Log) P;; K, the total number of word occurrences for code 
i, calculated as L_, x X;; EU, the entropy of a uniformly distributed variable with K unique 
values, calculated as Log,(K); and finally EO, a small value to avoid division by zero, 
calculated as - K/K + 1 x Log, K/K + 1. 

From the preceding, word weights are calculated as: EU-EW + EO/EO + EW. 


Phrase Score Calculations 

For each database phrase which is evaluated for an indirect match, a score is calculated. 
The score is based on: n, the number of words the phrases have in common; w,, the weight 
for word k; m, the number of words in the subject phrase; and /, the number of words in the 
database phrase. 


From the preceding, phrase scores are calculated as: n° 


xX Lea1 W/m X 1. 


Matching Parameters 
After calculating a score value for each potential match, ACTR compares the score against 
user supplied values for the following parameters and takes the action indicated. 


1, UPPER THRESHOLD 
If the resulting score is greater than or equal to this value, then a winner is considered 
to have been found. 


2. LOWER THRESHOLD 
If the resulting score is greater than or equal to this value, but less than that supplied 
for the upper threshold value, then a possible match is considered to have been found. 


3. PER CENT DIFFERENCE 
If more than one winner is found, and their scores are within the supplied value for 
this parameter, then multiple winners are considered to have been found. 
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Limiting the Search for an Indirect Match 

ACTR searches the matching database for possible matches using the known words in the 
subject phrase. That is, these words are used to search for database phrases which contain them. 
The search proceeds in order of the ascending frequency of occurrence of the known words. 
Thus, the known word which occurs the least frequently in the database is used to start the 
search, the next lowest is used to continue the search, and so on. 

As can readily be appreciated, finding a match by the indirect process has the potential of 
being time consuming and very expensive. Unfortunately, attempts to find matches by indirect 
means are unavoidable since a nearest matching feature is an essential component of any 
automated coding system. 

While performing a search in this manner, ACTR maintains a list of database phrases which 
have already been evaluated. After a database phrase has been evaluated, it will not be re- 
evaluated in a subsequent iteration for the currently executing matching effort. This ensures 
that a database phrase which contains more than one of the known words will not be evaluated 
more than once. 

As a further search optimization, ACTR makes use of the user supplied matching 
parameters. With these, it constructs a table of optimistic scores for each iteration of the word 
based search: 


1. For the first known word, the optimistic score is based on the possible occurrence of a 
database phrase with the same number of words as the number of known words and with 
all of its words in common with the subject phrase’s known words. 


2. For the second word, a similar assumption is made, but since the first word has already 
been used in the preceding search iteration, we know that any phrase containing the first 
word has already been evaluated. So, the optimistic score is based on the presence of the 
second and subsequent words only. 


3. Optimistic scores for succeeding iterations are based on the presence of the current and suc- 
ceeding unsearched words only. 


The formula used to calculate the optimistic scores is based on: a, the number of known 
words in the subject phrase; b, the number of words in the subject phrase already searched; 
c, the total number of words in the subject phrase; and d, the number of known words not 
yet searched, calculated asa — b; 


From the preceding, optimistic phrase scores are calculated as: (deBaweais W se 

With the table of optimistic scores in place, ACTR evaluates the potential score at each itera- 
tion before performing a database access. Thus, hopeless searches are never attempted. 

To summarize, the search for an indirect match is terminated when any of the following 
conditions are met: 


1. The maximum potential score for the current iteration does not meet or exceed the threshold 
defined for possible matches. 


2. At least one match has been found and the maximum potential score for the current itera- 
tion cannot produce another. 


3. The maximum number of possible matches requested by the user has already been found 
and the maximum possible score for the current iteration does not exceed that of the lowest 
scoring phrase. 
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5S. SUMMARY 


A flexible and efficient automated coding methodology, embedded in a generalized soft- 
ware system has been presented. The system can be used to perform automated coding for any 
application in English or French or both, using any classification scheme. In doing so, it makes 
use of a powerful generalized parsing strategy and significant performance optimizations. For 
further information on ACTR, the interested reader is directed to Connor, Salloum and 
Wenzowski (1988). 
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WALTER MUDRYK2 


ABSTRACT 


The methods used to control the quality of Statistics Canada’s survey processing operations generally 
involve acceptance sampling by attributes with rectifying inspection, contained within the broader 
framework of Acceptance Control. Although these methods are recognized as good corrective procedures, 
they do little in themselves to prevent errors from recurring. As this is of the utmost importance in any 
quality program, the Quality Control Processing System (QCPS) has been designed with error preven- 
tion as one of its primary focuses. Accordingly, the system produces feedback reports and graphs for 
operators, supervisors and managers involved in the various operations. The system also produces infor- 
mation concerning changes in the inspection environments which enable methodologists to adjust inspec- 
tion plans/procedures in accordance with the strategy of Acceptance Control. This paper highlights the 
main tabulation and estimation features of the QCPS and the manner in which it serves to support the 
principal quality control programs at Statistics Canada. Major capabilities from a methodological and 
systems perspective are discussed. 


KEY WORDS: Quality control processing system; Process control; Acceptance sampling; Acceptance 
control; Skip-lot sampling. 


1. INTRODUCTION 


This paper deals primarily with the features of the Quality Control Processing System 
(QCPS) that is presently being used at Statistics Canada. However, in order to show how this 
system fits into the overall quality picture for surveys, the paper begins with a brief discussion 
of the survey process and the role that quality assurance and quality control play in this process. 
The paper then identifies the specific quality control methods and strategies that are used for 
processing operations at Statistics Canada and how the QCPS serves to support this activity. 
The paper then proceeds to describe the system features and provides a summary of its major 
achievements. 


1.1 The Survey Process 


The requirement of ensuring quality in the overall survey process has always been consid- 
ered a high priority at Statistics Canada. In a very general sense, it may be viewed as being 
achieved through the application of a series of quality assurance (QA) and quality control (QC) 
measures at the appropriate stages of a survey process. It is important to distinguish between 
these two activities since in our environment, they involve very different approaches and pro- 
cedures that are normally applied at different points in the process. A simplified overview of 
the survey process at Statistics Canada includes the following stages: 


' This is a revised version of the paper presented at the Fourth Annual Research Conference, Bureau of the Census, 
Arlington, Virginia, USA, March 1988. 

2 W.V. Mudryk, Business Survey Methods Division, Informatics and Methodology Branch, Statistics Canada, 10-J, 
Coats Building, Tunney’s Pasture, Ottawa, Ontario, Canada, K1A OT6. 
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e planning 

e design 

e implementation 
© processing 

e publication. 


It is important to note that every one of these stages is subject to some error. It should also 
be realized that the further into the survey process the errors are discovered, the more impact 
they have on survey timeliness, cost and accuracy. Therefore, it is good practice to put a strong 
emphasis early in the process, on the development of measures and procedures that would pre- 
vent or reduce their occurrence. This should occur at the planning and design stages of the 
survey process. These measures and procedures are also known as quality assurance. 


1.2 Quality Assurance 


A general approach to establishing quality assurance is to try to anticipate problems very 
early in the survey process and take appropriate steps to prevent or minimize them. The anticipa- 
tion can be based on experience, reviews, evaluations, debriefing exercises, feasibility studies, 
etc. The steps could include improving sampling frames/designs, modifying data collection 
methods, improving questionnaire design, providing clearer processing procedures, efc. A com- 
prehensive list of such steps may be found in Statistics Canada’s Quality Guidelines (1987). 

This approach is extremely important since effectively it moves quality upstream and thereby 
helps to prevent many potential problems from occurring. Furthermore, in so doing, it assures 
better quality at the least cost by ‘‘getting it right the first time’’. Despite our best efforts how- 
ever, there are some situations when error levels continue to be unacceptably high. In these 
situations we consider the use of quality control. 


1.3 Quality Control 


In contrast with QA, statistical quality control has been found to be highly applicable at 
the processing stage of the survey cycle. At this stage, the work usually has the following 
characteristics: 


e Jabour intensive and repetitive in nature; 
e assigned to individuals or operators with varying abilities; 
¢ normally grouped into batches or lots of similar work units. 


As such, these survey operations are more prone to the occurence of errors. Examples of 
these operations include: 


e coding/transcription 

e manual editing/reviewing 
e data capture/entry 

¢ corrections/reconciliation 
¢ updating/profiling, efc. 


For many reasons, which include complexity of tasks, abilities of operators, turnover of 
staff, efc., the amount and significance of error varies between operations, between operators 
within an operation, and at times within operator. Statistical quality control is used to iden- 
tify and reduce this variability and ensure that the outgoing quality of each operation falls within 
acceptable levels. 
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2. QUALITY CONTROL STRATEGY 


2.1 Methods of Quality Control 


Of the two main methods of quality control available, namely, process control charts and 
acceptance sampling, we have found the latter methodology applied in the broader context 
of Acceptance Control, to be the more appropriate method for on-line quality control of survey 
processing operations. The reasons for this are as follows: 


* prior control or stability of process cannot be assumed initially nor always attained in the 
long run; 

¢ assignable causes of error are not always known since we are dealing with people (vs. say 
machines); 

* processes cannot readily be stopped and adjusted for assignable causes, even if they are 
known; 

¢ with many operators and large ‘‘between operator’’ variabilities, many individual control 
charts requiring immediate updating (i.e., after each sample observation) would be 
required on-line to the survey operation; this would be operationally difficult to achieve. 


Therefore our quality control strategy generally consists of using varying acceptance 
sampling procedures (with rectification) applied at the operator level, as a screening device 
for correcting substandard quality, with the aim of continually reducing inspection as the inspec- 
tion results support this action. This is coupled with an emphasis on operator and supervisor 
feedback to establish error prevention. In this manner both error correction and subsequent 
prevention are exercised at the error source, where they can have their greatest impact. Fur- 
thermore, between operator variations are automatically dealt with as each operator is effec- 
tively treated as a process in the following sense. During a period of low to moderate stability, 
acceptance sampling is applied to each lot processed. During a period of high stability coupled 
with good past inspection results, less acceptance sampling and even spot checking may be 
applied under the broader strategy of Acceptance Control. 


2.2 Acceptance Control 


After a quality control program has been operating for some time, operator processing 
abilities tend to improve and in many cases, a stabilization of quality occurs. In an effort to 
take advantage of this improved situation and to enable our quality control designs to be more 
economical, we have adopted the strategy that Schilling calls Acceptance Control (1982). Under 
this approach, acceptance sampling procedures are continually modified and adapted as changes 
in the inspection environment are identified. This is in accordance with one of QC’s main 
pioneers, H.F. Dodge who states (1950): 


‘‘A good product with a history of consistently good quality requires less 
inspection than one with no history or a history of erratic quality. Accordingly, 
it is good practice to include in inspection procedures provisions for reducing or 
increasing the amount of inspection, depending on the character and quantity of 
evidence at hand regarding the level of quality and the degree of control shown.”’ 


In fact the ultimate aim of acceptance control is to continually reduce inspection to the level 
of spot checks or process controls as the quality history improves and stabilizes. At Statistics 
Canada, two specific approaches are used to achieve this principle: 
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e Graduated Inspection Plans. These are obtained by raising or lowering the quality index for 
the sampling plan as changes in the process average are observed and then closely monitoring 
the impact on the resulting average outgoing quality estimates. 

© Cumulative Results Plans, more specifically Skip-Lot Sampling (Stephens 1982). Here, the 
extent of skipping lots depends on the stability and level of expected incoming quality. 


Both approaches are part of our acceptance control strategy and require a good quality 
history which would indicate not only the underlying level of processing quality (7.e., at the 
operator level) but also the extent of stability (/.e., degree of control) that can be expected in 
the process. Accordingly, the inspection process must provide: 


e good data (accurate error estimates); 

© quick results (monthly, weekly, daily); 

e incentive for improvement (feedback reports); 
e quality history (time series of error quality). 


Essentially these have been the motivating influences in developing the Quality Control Pro- 
cessing System (QCPS). It should be noted that changes are currently being made to the system 
to expand the existing operator quality history. This should provide the data to enable greater 
implementation of spot checks and/or process control for selected operators with exceptional 
and stable performances. 


3. SYSTEM DESCRIPTION 


Based on the strategy identified above, the QCPS has been developed to achieve the following 
objectives: 


e process any single acceptance sampling transaction; 

© provide output by operator where each operator can be treated as the error source; 

e provide feedback to four levels of staff with current and historical quality control infor- 
mation; 

© support the acceptance control strategy by enabling the processing of skip-lot sampling results 

and providing an extensive operator quality history; 

support the major QC objectives of error correction and prevention while enabling inspec- 

tion costs to continually be minimized. 


3.1 Methodological Features 


a. Inspection Schemes 

The system can process any quality control transaction resulting from the application of 
single acceptance sampling. This naturally includes normal, reduced and tightened plans as 
well as any skipped lots resulting from skip-lot sampling. The system will also process any lot 
whose plan designation is 100% inspection. 


b. Lot Status Codes 

The system determines the treatment of incoming QC transactions by using lot status codes 
which indicate the state of completeness of the intended inspection. There are codes for the 
following lot situations: 


sample inspected and accepted; 

sample inspected and rejected (remainder inspected); 
100% inspected; 

any of the above not completed (3 codes); 

no sample inspection due to skip-lot. 
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c. Attributive Quality Measures 

The system will produce estimates for various quality measures which include percent defec- 
tive, defects per hundred units and weighted error equivalents. For the latter quality measure, 
the system allows errors to be weighted according to a pre-defined error seriousness classification 
scheme. Typically, under these more complex measures, errors are categorized and assigned 
weights from 0 to 1 depending on their relative magnitude and seriousness. For purposes of 
simplicity, no more than four error categories are generally defined, as follows: 


Category Weight 
Critical 1.0 
Major 0.4 - 0.6 
Minor 0.2 - 0.3 

Insignificant 0.0 - 0.1 


d. Estimates 

The system provides estimates and their associated standard errors (where applicable) for 
many key quality control indicators. The most important of these are: 

(i) Error Rates 

Error rates are calculated which relate to the individual operator, a specific sampling plan 
or the overall application. These estimates are provided for various time frames (e.g., daily, 
weekly, monthly, quarterly, efc.), and various subsets of the application, such as specific lot 
categories (e.g., rejected lots) or sub-groups (e.g., regional offices). 
(ii) Operator Process Average 

An estimate of an operator’s processing ability at any particular point in time is provided 
by the operator process average. This estimate is calculated using an empirical Bayes approach 
(MacMillan and Mudryk 1988) which essentially shrinks the current operator sample error rate 
estimate part way towards the grand average error rate of the last four periods for that operator. 
The basis of shrinkage is determined by the ratio of the sampling variance of the current sample 
estimate to the total variance of the grand average estimate. This quantity has been found to 
produce good estimates for qualifying operators onto minimum inspection sampling plans. 
(ili) Rejection Rates 

Actual and expected rates of rejection are calculated for each sampling plan for purposes 
of statistical comparison and operational evaluation. The expected rates are obtained assuming 
Poisson probabilities. 
(iv) Inspection Rates 

Inspection rates are calculated at various levels as a general indicator of relative costs. These 
rates are determined with and without skip-lot effects on an actual and expected basis. The 
expected rates are a natural extension of the expected rejection rates discussed above. 
(v) Average Outgoing Quality 

An estimate is provided of the Average Outgoing Quality (i.e., AOQ) rate resulting from 
the application of quality control to the operation. This estimate projects the observed error 
rate at the operator level to the uninspected volume for that operator, and then aggregates all 
Operators to determine the overall estimate. 


e. Analysis 

The system provides tabulations and outputs which enable analyses to be performed at 
various levels which help to subsequently fine tune the application parameters and/or modify 
the plans. These include: 


* operator profiles that enable a sampling plan/procedure qualification analysis; 
¢ individual sampling plan evaluations that provide an overall QC plan analysis; 
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e summaries of key indicators that enable a QC cost-benefit analysis; 
a Pareto analysis of operator and error code contributions; 
group charts of operator process averages that provide an operations performance analysis. 


f. Reports 

The system produces 8 reports and 5 graphical outputs (through its link to SASGRAPH) 
for each application run. Tabulations can also be produced for specified sub-groups (e.g., 
Statistics Canada’s regional offices) with a summarizing feature over all sub-groups of each 
report. 

Each set of output reports is designed for and disseminated to four levels of staff, namely: 
operator, supervisor, manager and QC designer. Examples of the output reports are available 
from the author. 


3.2 Software Features 


a. Operator Capacity 

For each application, the system can handle up to 108 operators in its historical file, each 
containing up to three previous periods of error information. A unique self-maintaining feature 
of this file is that any operator who has not been active during at least one of the last 4 con- 
secutive months of processing is dropped. This makes room for new operators on the file and 
thereby increases the effective file capacity. 


b. Historical Updates 

The system updates each operator error quality history (of up to 4 consecutive periods) with 
new information as it becomes available. This is currently being increased to 6 consecutive time 
periods. If an operator has not processed during a particular month, blank data for that month 
is inserted. Likewise, application year-to-date and quarterly totals are updated with the addi- 
tion of each new month of QC data. 


c. Year-End Rollover 

Most of the QCPS applications are maintained on a calendar year basis. When this option 
is specified, the system will zero out the previous monthly totals and commence a new applica- 
tion time series (usually starting in January). The quarterly totals and the operator error quality 
time series however, are not re-set at this time and continue to be maintained as usual. 


d. Recovery 

If a tabulation run is made and errors are subsequently discovered, another run can be made 
using the recovery feature with the corrected data, to automatically produce the corrected 
outputs. 


4. SYSTEM BENEFITS 


The QCPS is aimed at servicing the needs of four levels of staff which interface with each 
QC application. Accordingly, the major achievements of this system can best be described under 
these same headings: 


a. Operator Level 

The QCPS provides extensive feedback to the individual processing operators on their cur- 
rent and historical performance. The operators are then able to track their own progress, com- 
pare their own performance with that of their peers, and identify explicitly where their errors 
are being made. The result of this feedback generally leads to: 
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¢ improvement in operator processing ability; 
* increased motivation with respect to peers; 
¢ greater quality consciousness; 

e higher operator morale. 


b. Supervisor Level 
The system provides operational information to the supervisors which enables them to better 
manage their operation in terms of: 


¢ effective resource allocation and work distribution; 
¢ identifying problem operators and/or areas; 
¢ determining training needs. 


c. Management Level 
The system provides data summaries on key quality control indicators for management which 
enables them to: 


¢ receive an assurance of quality; 
¢ track the progress of the application in terms of quality and costs; 
¢ recommend changes to operational objectives. 


d. QC Design Level 

The system provides extensive information (e.g., estimates, quality histories) which is used 
to analyze the quality control design and fine tune or enhance the methods and procedures 
of each application. When this data has been established and maintained over a sustained period 
of time, it can lead to: 


¢ improvements in QC methodologies and procedures; 
¢ sampling plan and/or inspection procedure adjustments; 
¢ minimization of inspection costs. 


5. CONCLUSIONS 


The QCPS is being used at Statistics Canada to support the Quality Control programs of 
many production oriented survey processing operations. As the ultimate aim of each program 
is to exercise error prevention to the extent possible, as well as, to progressively reduce inspec- 
tion to the level of spot checks, a good and flexible processing system is essential. The QCPS 
achieves these objectives by providing good data and quick results to the various levels of staff 
that are involved in each operation, as well as, supporting the various inspection methods that 
fall under the general strategy of Acceptance Control. 

The system is particularly attractive to our user community since it can easily handle large 
volume operations involving many operators, quickly and at a low cost. Furthermore, by 
treating each operator individually, the system focuses attention to each relevant error source 
and supports this with necessary feedback to the appropriate levels of staff. In this manner 
the system enables our quality control methods to be both preventive and corrective in an effi- 
cient and economical manner. 
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Postal Address Analysis 


YVES DeGUIRE! 


ABSTRACT 


When we examine postal addresses as they might appear in an administrative file, we discover a com- 
plex syntax, a lack of standards, various ambiguities and many errors. Therefore, postal addresses rep- 
resent a real challenge to any computer system using them. PAAS (Postal Address Analysis System) is 
currently under development at Statistics Canada and aims to replace an aging routine used throughout 
the Bureau to decode postal addresses. PAAS will provide a means by which computer applications will 
obtain the address components, the standardized version of these components and the corresponding 
Address Search Key (ASK). 


KEY WORDS: Postal addresses; Administrative data; Parsing; Standardization; Search key. 


1. INTRODUCTION 


Postal address analysis can be defined as the process of identifying the basic components 
of an address which appears in free format, standardizing those components, and generating 
an identifier for that address. This process can be used, for example, in the pre-processing 
step of any record linkage application that uses an address field or in the generation of a 
key for database access. Statistics Canada, as part of its 1991 census research program, is con- 
ducting a study on the implementation of a national Address Register. Such a register con- 
tains basically, postal address information. This information must be analyzed carefully in 
order to produce a register and to assess its quality. The Address Register Research Team has 
recognized that fact and research into the area of automated postal address analysis was 
initiated. 

This paper presents the results of this research on postal address analysis. The nature of 
an address and its related problems will be described. Also, some computer considerations will 
be discussed to explain why new software is needed for the Address Register and Statistics 
Canada. Finally, we will examine PAAS (Postal Address Analysis System); a system currently 
under development at Statistics Canada. 


2. POSTAL ADDRESSES: STATEMENT OF THE PROBLEM 


A postal address can be defined as a string of characters representing a location where an 
individual can pick up his mail. By location, we meana physical place where the deliverer (like 
a postman) and the receiver agree in the matter of mail reception. It can bea dwelling, a postal 
box, a street or arural route. To restrict our field of study, we are going to examine the addresses 
that are Canadian (French and English), that represent residential locations and that should 
result in correct mail delivery. 


! Yves DeGuire, Research and General Systems, Statistics Canada, room 2405 , Main Building, Tunney’s Pasture, 
Ottawa, Ontario, K1A OT6. 
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As one would expect, the flexibility in the address definition results in problems for any 
computerized application having to deal with postal addresses. Even a person is likely to 
encounter some problems with addresses with which he/she is not familiar. Three major prob- 
lems are analyzed here. 


2.1 The Syntax of a Canadian Postal Address is Complex 


A postal address is composed of tokens (lexical items which can be considered as basic units 
in an address). A token can be either a delimiter, a term (or keyword), a word, a letter or a 
number. Figure 1 illustrates an example of token decomposition. Tokens can be combined to 
get address components which are larger address structures. In turn, a component can fall into 
three groups: designators, qualifiers and secondary words. Figure | gives also an example of 
a component decomposition. Valid addresses are composed of both a set of valid combina- 
tions of components and a set of valid combinations of tokens. However, it is more practical 
for implementation purposes, to define an address with token patterns (combinations of 
tokens). Token patterns can be generated from a formal postal address grammar (written in 
BNF for example) and used directly for constructing a postal address. 

This syntax is fairly complex. First of all, the grammar is sizeable. We have analyzed a 
national sample of 30,000 addresses taken from six different administrative files. In these 
addresses, we found around 4,900 different token patterns. This is substantially higher than 
what is reported in Drew(1987) because we have analyzed addresses from many different files, 
not just one. Other interesting results concern the distribution of those patterns. Only 37 pat- 
terns are necessary to cover 50% of the addresses. So, there are a few common patterns, but 
most of the patterns are rather rare. Nevertheless, this analysis illustrates the complexity of 
postal address syntax by demonstrating that it is not restricted to just a few patterns. Secondly, 
as much as 600 different terms can be found in a good national sample of addresses. Thirdly, 
an address is usually in free format, i.e. the components (and the delimiters) can occur in any 
one of several positions. 


2.2 Addresses Don’t Follow Precise Standards 


Addresses representing the same address location can be written in many ways as illustrated 
in Figure 2. The reason for this situation is the flexibility in postal address syntax and also human 
nature. In fact, people write addresses as they like and follow the ‘‘standards’’ in use in their 
immediate environment. 


Token ‘eyfoy J} |DYeYs) 1003 Prince of Wales dr , Ottawa, Ont 
decomposition flee efi es 
number term term word term 
term | word word word delim delim 


letter delimiter 


Component c/o J Doe , 1003 Prince of Wales dr, Ottawa, Ont 
decomposition ey ey eee eee | 
| qualifier | designator qualifier 
secondary word qualifier qualifier 


Figure 1. Two ways of decomposing a postal address. 
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2.3 Ambiguities Occur in Postal Addresses 


A postal address can’t be regarded only from a syntactic point of view. Its semantic (i.e. 
the meaning a postal address) must be examined as well. Sometimes, one address can poten- 
tially represent more than one location. We then face an ambiguity since we don’t know how 
to interpret it. To do so, more knowledge is required in order to exclude the locations that don’t 
exist and to identify the correct location. However, this knowledge doesn’t always permit us 
to narrow down the location; we then face an unresolvable ambiguity. Figure 3 shows an 
example of an ambiguous address. 


3. COMPUTER SYSTEMS CONSIDERATIONS 


Now that we have a better understanding of postal addresses as well as their related prob- 
lems, we will concentrate on the use of postal addresses in computer systems. 


3.1 Computer Applications Requiring Address Information 


Several types of application require address information. Some record linkage projects 
link individuals or dwellings (like in the construction of an Address Register) based on their 
postal addresses. Their linkage rules perform essentially on standardized address components. 
On the other hand, databases and computer files storing postal addresses are numerous. For 
example, postal addresses information for an Address Register must be stored in some fashion, 
either in a stand alone flat file or in some kind of integrated database. But what information 
is stored? Address components (standardized or not) could be. For follow-up or historical 
purposes, the original input address could be kept as well. However, retrieval from a large 
database (or a large flat file) requires an Address Search Key (ASK) to allow direct access (or 
direct matching) to a record identified by a postal address. Mailing labels processing is another 
area where postal addresses is a big concern. Address components, standardized or not, can 
form mailing labels. 


1) 32 main st apt #1, Ottawa, Ontario 
2) 32 main st apt #1, Ott., Ontario iat pe represent the same location 
3) 32 main st 1, Ottawa, Ontario 

4) 860 first st, Ottawa, Ontario 


5) 860 1 st, Ottawa, Ontario SS represent the same location 
6) 860 1 st, Ott., Ont 


Figure 2. Examples of Addresses Which Represent the Same Location. 


9 76 Fort St John BC 
| 


can be ??? 


2) Apt 976 Fort, St-John, BC existing address, we have an 


1) Apt 976, Fort St-John, BC if at least two are an 
3) Apt 976 Fort ST, John, BC unresolvable ambiguous address. 


Figure 3. Example of an Ambiguity. 
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3.2 Three Basic Information Components 


Therefore, three basic information components need to be derived from a free format postal 
address: the address components, the standardized components and the Address Search Key 
(ASK). 


1. THE ADDRESS COMPONENTS 


They represent recognizable and useful portions of an address. The major address com- 
ponents are street number, street name, street direction, street designator, postal designator, 
postal qualifier, municipality name, province name, and postal code. 


2. THE STANDARDIZED COMPONENTS 


They are the standardized version of the address components, where any style variations 
are removed. 


3. THE ADDRESS SEARCH KEY (ASK) 


This is a compressed string, unique for a given address. 


3.3 Postal Address Analysis System 


A complete Postal Address Analysis System (a computer system that generates the three 
basic information components we need) represents an expert system in the field of postal 
addresses. Expert because you replace a specialist (like a postman) in address recognition. At 
Statistics Canada in the 1970’s, two routines were developed to analyze postal addresses. 
ENCODA (component decompositon) and ASKGEN2 (standardization and ASK) were 
implemented for the Business Register Maintenance System. They served well until recently. 
With the advent of powerful computers, new software development techniques and the Address 
Register itself, they don’t perform to today’s standards. 


— The encoding success rate is too low. A study using a national sample of addresses from 
many administrative files shows that ENCODA cannot properly decode an address, on 
average, 15% of the time. This is not acceptable since it could lead, in the case of the cre- 
ation of a national Address Register, to over one million encoding failures. . 

— The user interface is poor. There is no comprehensive status produced at the completion 
of the analysis. As well, very few utilities are provided in order to ease programming 
burden. 

— The functionality is incomplete. Standardized components and ASK are mixed up in the same 
data structure. Standardized components are truncated to allow data compression but ASK 
is very long because it is stored in fields of fixed length. Also, the software doesn’t recognize 
address ambiguity. ; 

~ Maintenance of the software is a nightmare. New address patterns are difficult to incorporate 
into the routines because these are complex and tend to become more and more so with time. 
This is a sign of aging software. 


To fulfill the requirements of an Address Register and of Statistics Canada in the 
area of address analysis, the development of a completely new system was initiated. The 
problem this time has been approached with expert system techniques, modular design 
and full scale implementation. This new system is called PAAS, for Postal Address Analysis 
System. 
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4. A POSTAL ADDRESS ANALYSIS SYSTEM: PAAS 


PAAS is currently under development. Therefore, some results are preliminary, but in gen- 
eral very encouraging. We will review here the four basic functions of the system. 


4.1 Address Parsing 


The parsing function is the most important and complex function of PAAS. Here, PAAS 
accepts as input a free format address, scans it (breaks it into lexical items) and parses it (analyzes 
the syntax) to decode it into address components. 

This parser generates the following items for every address processed (Figure 4 illustrates 
two examples of this output): 

- A comprehensive Address Status code; such as V for valid, E for syntax error, etc. 

- Identification of components in the input address. 

- Components classification: every component is classified using a detailed code, so it is easy 
to understand the meaning of a component. This code is divided into three sub-codes: 

- TYPE code: indicates the group of components to which a component belongs. Example 

of TYPES are those for province (PR), municipality (MU), street (ST), etc. 

- CAT code: refines the group of components indicated by TYPE. Examples for the street 

TYPE (ST) are name (NA), number (NU), designator (DE), etc. 
- CLASS code: classifies a component by examining its characteristics. Examples are avenue 
(AV) or road (RD) classification of a street designator. 
- Ambiguity detection: the PAAS parser flags any component that could change because of 
an ambiguity. 


The PAAS parser was implemented using MPL. MPL is a meta-programming language. 
It allows us to generate programs or subroutines used for syntax analysis and automatic transla- 
tion. The input to MPL is a set of specifications divided into the scanning (token recognition), 
the syntax rules and the semantics. The scanning represents the lexical analysis where the input 
is broken down into tokens. The syntax specification is similar toa BNF grammar specifica- 
tions: the right-hand side symbols of a syntax rule are defined by the left-hand side symbols. 
Figure 5 gives examples of syntax rules. Finally, a semantic action can be associated with any 
rule and is used to handle some complex aspects of the syntax, as well as to perform other actions 
(such as updating a table of components). The MPL language is well suited to writing transla- 
tion specifications and has been used at Statistics Canada to implement STATPAK (retrieval 


ADDRESS COMPONENT TYPE CAT CLASS AMB__FLAG 
(1) 32 Main st, Ottawa, Ont 32 Silt NU ** 
Main ST NA ** 
ADDRESS__STATUS====> V st ST Je Sy 
Ottawa MU NA ** 
Ont PR NA 35 
(2) 32 Main st Ottawa Ont ey Si NU ** 
Main ST NA «* 
ADDRESS__STATUS====> A st Sy DE Re Sik * 
Ottawa MU NA * * 
Ont PR NA 35 


Because the second example misses the commas to delimit the address, an ambiguity is flagged by PAAS. 


Figure 4. Examples of the PAAS Parser Outputs. 
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and tabulation system for the census), NYSIIS (name encoding routine) and NAMEPARS 
(name parser). It saves development time (e.g. you don’t need to write a detailed and custom 
program ina traditional programming language such as PL/1). The specifications in BNF are 
much easier to understand than is a program with a complex logic. 

The PAAS parser involves a rather complicated syntax analysis and represent a fairly impor- 
tant MPL application. For example, a dictionary containing more than 600 terms assist in the 
scanning of addresses. As well, more than a hundred syntax rules implement the syntax analysis. 
In this syntax analysis, the initial tokens are transformed from arule right- hand side to arule 
left-hand side and become higher level address fragments (this is known as forward chaining) 
until the address is completely analyzed. During this process, the address components are iden- 
tified and stored in a table by the semantic action of a rule. The invalid addresses are found 
whenever no rule is applicable. A sample set of rules to decode an address is illustrated in Figure 
5. Finally, for some complex addresses, a special analysis is peformed through the use of the 
MPL semantic facility. This is required anytime an ambiguous term is encountered. In this 
case, PAAS analyses the surroundings of the ambiguous term. 

In comparison with ENCODA, the PAAS parser is an improvement in the following are as: 
— The quality of the parsing: the PAAS parser is able to decode more addresses successfully 

than ENCODA does. A series of parallel runs over identical national samples of addresses 

showed that PAAS is successful on more than 97% of addresses, while ENCODA properly 
handles only 85% of them. 


— The indication of an address status: the status is more complete than ENCODA’s which pro- 
vides for only two possibilities: decoded address or blank address! 


- Thecomponents: PAAS generates much more comprehensive component information than 
does ENCODA. 

— The maintenance: the utilization of MPL helps in making the PAAS parser a lot easier to 
maintain than a huge algorithm such as is used by ENCODA. 


4.2 Components Standardization 


The standardization aims to remove any style variation in the address components defined 
in the parsing phase. 

Unlike ASKGEN2, PAAS doesn’t truncate any component and retains all the information 
in the components. This standardization is achieved basically in three different ways depen- 
ding on the nature of the component: 


1. CODABLE COMPONENTS 
Every component for which a limited number of values exist is standardized by replacing 
its value with the CLASS code of the component (this code uniquely identifies the stan- 
dardized value of the component). Falling into this category are components such as the 
province name, street designator, efc. 


2. NAME COMPONENT NOT NUMBERED 


To standardize a non-numbered name component, several rules must be applied to 
transform the original value into a standardized value. The rules vary from the removal 
of useless characters (e.g. quote, hyphen, efc.) to abbreviation replacement (e.g. Mtl 
becomes Montreal). 


3. NAME COMPONENT NUMBERED 


A numbered name component is standardized by returning its name as a number. For 
example First becomes 1, Second 2, efc. 
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Ask 


The Address Search Key should be unique and short. 

Uniqueness is accomplished by concatenating in a pre-determined order the standardized 
components of an address (rather than a table as with ASKGEN2). We must note here that 
the ASK doesn’t necessarily represent a unique identifier for dwellings. In rural areas for 
example, a postal address quite often represents many dwellings (e.g. RR #1 Ottawa Ontario). 


Address to parse: 100 Rideau st Ottawa Ont K1N5X2 
At some point, we have a string of address fragments which will be transformed 
by five rules. The “| ” denotes a “OR” and [] is an optional syntax element. 


<NUMBER> <WORD> <ST__DESIGNATOR> <MUNICIPALITY > <PROVINCE> <PC> 
String of address fragments that will be 
transformed by rule (1). 
< NAME> ::= <WORD | NAME> [WORD] RULE (1) 
<NUMBER> <NAME> <ST__DESIGNATOR> <MUNICIPALITY > <PROVINCE> <PC> 


New string of address fragments from rule(1). This string will 
be transformed by rule(2). Note that a semantic action asso- 
ciated with rule(2) would be appropriate to identify the street 


name component. 
<ST__NAME> ::= <NAME> RULE (2) 


<NUMBER> <ST_NAME> <ST_DESIGNATOR> <MUNICIPALITY > <PROVINCE> <PC> 


7] 


<ST_.NUMBER> ::= <NUMBER> RULE (3) 
<ST_NUMBER> <ST_NAME> <ST_DESIGNATOR> <MUNICIPALITY> <PROVINCE> <PC> 
<ST__ADDRESS >::= <ST_NUMBER> <ST_NAME> <ST__DESIGNATOR> RULE (4) 
< ST__ADDRESS > <MUNICIPALITY > <PROVINCE> <PC> 

<ADDRESS> ::= <ST_ADDRESS> <MUNICIPALITY > <PROVINCE> <PC> RULE (5) 


The process is complete since the string has been analyzed entirely. 


Figure 5. Rules for a Sample Address Syntax. 
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To shorten the key, different compression techniques can be used. However, compression 
takes time and we have to choose a technique that will be efficient. We are experimenting with 
two different techniques. 


1. TRUNCATION 
here, the name components are truncated. This technique is not real compression and could 
affect the uniqueness of a key. However, it is simple and fast. 


2. REAL COMPRESSION 
a compression technique that we are looking at consists basically of replacing common com- 
binations of characters by a character code not in use for writing an address. Here, we will 
preserve the uniqueness but increase the complexity of generating and using a key. Therefore, 
a longer ASK calculation time is expected with this technique. 


4.3. Ambiguity Resolution 


Once an ambiguity is determined from the parsing, it must be resolved, either manually, 
or automatically by the PAAS system. PAAS uses a municipality name file (this file covers 
the whole country with around 6000 names and has as its source in the Postal Code Directory 
tape from Canada Post) in an attempt to resolve an ambiguity. 

This methodology is limited to the problems related to municipality names. This is not so 
bad since these problems account for a good portion of the ambiguous situations, and are easy 
to detect and to resolve (they don’t involve a large amount of data). Future work could examine 
the usefulness of detecting and resolving more situations. 

Finally, no matter how good the software becomes, the unresolvable and the non-existant 
addresses will remain a problem and should be followed-up manually. 


5. CONCLUSION 


The results of postal address analysis as accomplished by PAAS are encouraging. It decodes 
avast majority of addresses, outputs a very informative code for every component, standard- 
izes and generates an ASK properly, and handles ambiguities. Also, PAAS integrates utilities 
and interfaces for users and maintainers. 

Users have access to an interface which processes their addresses through the four basic func- 
tions as well as a facility that handles the addresses in error (on-line processing). A file pro- 
cessor program is also provided. 

Also integrated into PAAS is a quality assurance tool for PAAS maintainers. PAAS will 
evolve in the future with the discoveries of new addresses and obsolete addresses. Making sure 
that the changes to the system are applied properly is tricky. This maintainance tool ensures 
that a change to the software doesn’t jeopardize any valid addresses properly analyzed in 
previous versions of the system. 
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A Brief Note on SQL 


DAVID N. EMERY! 


ABSTRACT 


This note portrays SQL, highlighting its strengths and weaknesses. 


KEY WORDS: Relational database management system; Database query language. 


1. INTRODUCTION 


A great deal of media attention has been focused on relational database management systems 
and SQL (pronounced see-quel), the most popular of the associated database query languages. 
To a large extent, SQL has been cast in the role of panacea for all the ills associated with data 
management. Unfortunately, this leads to a great deal of misconception on the part of poten- 
tial users of SQL. These people are then sometimes disappointed with SQL when they even- 
tually get a chance to use it. 

The intent of this note is to clear up some of this misconception by providing a realistic por- 
trayal of SQL, highlighting its inherent strengths and weaknesses. No attempt will be made 
to elaborate the advantages of the relational data model itself. These advantages have been 
adequately documented elsewhere (Date 1985). 


2. SQL - WHAT IS IT? 


The interaction which takes place between a user (whether systems developer or end user) 
and a database management system can be broadly categorized according to the function taking 
place: 


e data definition; 

¢ data control (i.e. authorization and control of data integrity); 
e data retrieval; and, 

e data modification (i.e. insert, update, and delete). 


A database management system must provide interfaces for carrying out each of these func- 
tions. Depending on the particular system, these interfaces take the form of utilities, query 
languages, and/or subroutine libraries for programming languages. 

SQL addresses these four functions in a single well-defined, rigidly structured language. 
SQL is the interface used to communicate, to the database management system, how relations 
(i.e. logical files or tables) are to be subdivided and/or combined to create new relations. 

The key to understanding SQL’s capabilities is an appreciation of the fact that SQL addresses 
exactly these four roles - no more and no less. Any other functionality must be supplied by 
the application which initiates the SQL statement. 

Consider the following example. The table, DWELLING, contains information about 
dwellings such as number of occupants, type of dwelling, where it is located, type of heating, 


! David N. Emery, Statistics Canada, Research and General Systems Subdivision, Room 2405, Main Building, Tunney’s 
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and age of dwelling. In order to impute the type of dwelling, one might want to obtain a set 
of potential donor dwellings which are in the same geographic area, are the same age, and use 
the same heating method. The following SQL statement could be issued to obtain a donor set: 


SELECT DWELLING_ID, TYPE_OF__ DWELLING (Query 1) 
FROM DWELLING 
WHERE HEATING _ TYPE = ‘GAS’ AND 
AGE = 20 AND 
LOCATION__ CODE = ‘XYZ’; 


SQL does not provide a mechanism for manipulating the set of retrieved donor records. 
Selecting the n’th record, every second record, or a random record are all beyond the capability 
of SQL. Similarly, SQL has no mechanism for manipulating a table to affect its appearance 
on a terminal or printer. These are capabilities one would rightfully demand of a program- 
ming language, and hence the term database query language. Calling SQL a fourth genera- 
tion language (4GL), then comparing it to products which incorporate only the data retrieval 
and data modification functions into a programming language, only adds to the confusion. 
It is really an apples and oranges comparison since both are 4GLs, but of very different flavours. 

Given this very focused functionality, the obvious question then has to be — why all the 
fuss about SQL? 


3. SQL — ITS BENEFITS 


3.1. Implementation Transparency 


A SQL query indicates nothing about how the data is actually organized and stored on the 
database. The query states what is to be retrieved, modified, or stored; the database manage- 
ment system determines the best way to do it. Issues such as: 


e which data columns are indexed (a performance improvement feature); 


e whether the table/column is actually stored or merely an execution time combination of 
other tables; and, 

e the data’s internal representation (i.e. floating point, packed decimal, binary) 

have no bearing whatsoever on a SQL statement’s syntax. Consequently, the user is immune 

to changes in the database’s organization and structure. Changes to the underlying structure 

of the database can be made at will without changing the query. A query is immediately able 

to take advantage of improvements in the database structure or optimization algorithms. 
Similarly, when formulating a SQL query the user does not specify the order in which pro- 

cessing is to take place to satisfy the query. That is the responsibility of the query processing 

software’s optimization algorithms. This software evaluates the query against the current struc- 

ture and organization of the database to determine the most efficient way of satisfying it. 


3.2 Non-proprietary, Internationally Accepted Standard 


Both the International Standards Organization (ISO) and the American National Standards 
Institute (ANSI) have recently adopted a common standard for SQL (ISO 1987). The existence 
of this standard, with a commitment to it by a number of relational database management 
system vendors, gives software developers access to a much broader market without significantly 
extra development effort. By building their applications on top of standard SQL, they have 
removed their reliance on a particular database management system. As a result, the creation 
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of software tools, built upon an interface to this standard version of SQL, has become a major 
growth industry. For example, natural language interfaces, fourth generation programming 
languages, data dictionary software, data entry/validation packages, and spreadsheet soft- 
ware, all layered on top of ANSI/ISO SQL, are beginning to appear on the market. 

The active interest in SQL has also had a very positive impact on the SQL standard itself; 
it is continuing to evolve. The most recent draft revision to the ISO Standard for SQL incor- 
porates the specification of referential integrity constraints into SQL’s data definition 
statements. The significance of this extension to SQL is best illustrated by a further elabora- 
tion of the DWELLING example. Assume that the database also has a table PERSONS which 
contains detailed information about individuals including a dwelling code which indicates the 
dwelling where they currently reside. One might define a integrity constraint stipulating that 
each person must be associated with exactly one dwelling. Consequently, it would be an error 
to delete a DWELLING record which still had any PERSONS records referencing it, or to add 
a PERSONS record which referenced a nonexistent DWELLING record. Currently, logic to 
detect and prevent these inconsistencies must be inserted into each application program capable 
of deleting a DWELLING record. With the incorporation of referential integrity specifica- 
tions into SQL, this program logic will no longer be required. The DBMS software assumes 
responsibility for detecting and terminating any attempt to remove a DWELLING record which 
still has associated PERSONS records. 


3.3. Ease of Extension 


One of the major differences between the various vendors’ versions of SQL is the number 
and variety of supported functions. This is to a large extent due to the ease with which extra 
functionality can be incorporated into SQL, without change to its overall structure. For 
example, the SQL standard documents the grouping functions of average (AVG), maximum 
(MAX), minimum(MIN), enumeration (COUNT) and aggregation (SUM) for unweighted data. 
Referring again to the earlier DWELLING example, one could generate various summary 
statistics about number of occupants, broken down by geographic location: 


SELECT AVG (NO__OF__OCCUPANTS), MAX (NO__OF__OCCUPANTS), (Query 2) 
MIN (NO_OF__OCCUPANTS), SUM (NO__OF_ OCCUPANTS), 
COUNT (NO__OF_OCCUPANTS) 
FROM DWELLING 
GROUP BY LOCATION_CODE; 


Some Vendors have augmented these functions with others such as variance (VARIANCE) 
and standard deviation (STDDEV). With these extra functions the identification of outliers, 
more than one standard deviation from the mean, is a very straightforward exercise: 


SELECT DWELLING _ID FROM DWELLING (Query 3) 
WHERE NO_OF_OCCUPANTS < 
(SELECT AVG (NO_OF__OCCUPANTS) - STDDEV (NO__OF_OCCUPANTS) 
FROM DWELLING) 


OR 


NO_OF_OCCUPANTS > 
(SELECT AVG (NO__OF__OCCUPANTS) + STDDEV (NO__OF__OCCUPANTS) 
FROM DWELLING): 


3.4 Single Interface to the Database 


When interrogating a database from within a host language program such as PL/1, 
FORTRAN, or C, one also uses SQL statements. These statements are virtually identical to 
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those used when interrogating the database interactively via a SQL statement processor. The 
only difference lies in the fact that the host language interface requires an additional INTO 
clause to indicate the program variables receiving the results of the query. 

By using an identical interface to a host programming language, one is able to separate the 
program development and debugging exercise into two distinct activities: 


e testing of the database retrieval storage statements (/.e. the SQL statements themselves), and 
e testing of the program code which manipulates the data. 


The first of these activities can be carried out using a SQL command interpreter even before 
the host language program has been written. The optimal SQL statements can then be moved 
directly into the host program where the testing effort can be focused on the logic associated 
with manipulating the data. 

Since the SQL statements embedded in the host language are interpreted at execution time, 
any changes made to the database organization or structure are immediately reflected in the 
program. 


3.5 Suitability for Distributed Databases/Database Machines 


One of the hottest topics in database management systems technology today is distributed 
databases. In a distributed database environment, the data is spread across a number of dif- 
ferent databases (often on physically separate machines). It is the DBMS software’s respon- 
sibility to intercept a user’s query, translate it into appropriate queries to the various constituent 
databases, and assemble the results of these queries for presentation. 

As discussed earlier, a SQL statement is devoid of constructs associated with describing how 
or where the data is stored on the database. Consequently, in a distributed database environ- 
ment where SQL is used as the database query language, data can be moved between machines 
with no change whatsoever to existing applications. SQL is therefore becoming quite popular 
with the developers of distributed database management systems. 

For similar reasons, SQL is gaining popularity as a query language for database machines. 
These machines take advantage of relational (i.e. tabular) data structures’ inherent regularity 
to partition them across a number of parallel processors. These processors have instruction 
sets specifically designed to perform relational operations. The lack of representational detail 
in SQL queries completely insulates users from an awareness of what these machines are doing 
behind the scene. 


4. SUMMARY 


There is no question that SQL has quickly become the pre-eminent database query language. 
The database management system which does not feature a SQL interface will soon be the excep- 
tion. An interesting anomaly will however emerge. The user will, over time, see less and less 
of SQL. Rather than trying to make SQL itself a user-friendly language, effort will be focused 
on the devlopment of application specific tools which provide the user with an interface tai- 
lored to the task at hand. SQL will be the common interface between these tools and the various 
databases. 
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ABSTRACT 


A comprehensive bibliography of books, research reports and published papers, dealing with the theory, 
application and development of randomized response techniques, includes a subject classification. 


KEY WORDS: Survey; Sensitive issues; Confidentiality. 


1. INTRODUCTION 


The recent increase in requirements for extensive data on sensitive issues, (such as the 
detailed information on sexual behavior, necessary to study the spread of the AIDS epidemic), 
has lead to renewed examination of the techniques available for obtaining answers to sen- 
sitive questions. The difficulties of applying conventional survey techniques to obtain data 
On sensitive issues in a large-scalesurvey are well known and several alternative techniques 
have been proposed - Bradburn and Sudman (1979). The most prominent of these has been 
the randomized response technique, originally proposed by Warner (1965). The underlying 
idea is that the respondent uses a random mechanism to select the question to which he answers 
and the interviewer knows only the response itself, without knowing which question is being 
answered. This is supposed to reduce biases due to non-response and to response error, by 
assuring the respondent that his privacy is protected by the method (in that the question he 
is being asked is unknown to the interviewer) and thereby convincing him to cooperate more 
readily and to answer more truthfully than he might by a direct question. 

Since 1965 a great deal of research into various aspects of the technique has been carried 
out. This includes theoretical developments, development of new randomization techniques 
and extensions to quantitative variables, to polytomous questions and to the multivariate case. 
Problems of estimation, optimization of design parameters and sample design, specific to 
randomized response, have also been dealt with. A large number of empirical studies using 
randomized response have been carried out in various application areas, such as studies of 
drug use, abortions, drunken driving and crime, many of them with some evaluation, often 
by validation studies. The experience in these studies is very divergent, with some showing 
marked gains due to the use of randomized response and others showing no gain at all in 
response rates or in response reliability. Respondents’ attitudes to randomized response, their 
comprehension of the procedure, their perceptions of confidentiality and of the protection 
that the procedure provides have also been investigated, in attempts to understand the reasons 
for the differences in the empirical results. 

This large body of research is scattered among over 250 theses, research reports, published 
papers and books, which have appeared, (in at least seven languages), over the last 20 odd 
years. These include many expository and survey papers and two bibliographies - Kim and 
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Flueck (1976) and Daniel (1979) - the latter an annotated one. Three comprehensive books 
on the subject - Defaa (1982), Fox and Tracy (1986) and Chaudhuri and Mukerjee (1988) - 
have also appeared. Unfortunately none of these include a fully comprehensive and updated 
bibliography and the present one is an attempt to correct this lacuna. 

Although an attempt has been made to be as comprehensive as possible, by including both 
published and unpublished papers, the latter are obviously covered only in as far as informa- 
tion about them was available from various sources. In addition, an attempt was made to reduce 
duplication by excluding unpublished reports or papers presented at meetings whose content 
is substantially included in a subsequently published paper. However, Ph.D. theses are gener- 
ally included, since they usually have more detail than the papers derived from them. Papers 
about other survey methods for dealing with sensitive issues, which can be considered as alter- 
natives to randomized response, are included only if they relate to a comparison of the alter- 
native to randomized response. Papers dealing with randomization techniques to ensure 
confidentiality of data already collected (such as random rounding or encoding) are not 
included, unless they also relate to the use of randomization in the collection process itself. 

The bibliography is arranged as an alphabetical listing, which gives full citation details in 
the standard way used for reference lists. Titles are given in the language of the paper or book, 
if known. Otherwise, for publications not in English, the title is given in English with a designa- 
tion of the original language in parentheses. Most of the non-English papers include a sum- 
mary or abstract in English. A series of letter codes on the right edge of the page, opposite 
each reference, indicates a classification by subject. The classification categories and codes 
are given below. An author index and a classified listing by subject, not included due to space 
limitations, are available from the author. 


2. SUBJECT CLASSIFICATION CODES 


A - Applications and field experiments. 

B - Bibliographies and survey papers. 

C - Confidentiality, respondent comprehension, attitude and protection. 
E - Evaluation of alternative techniques or estimators. 
H - Hypothesis testing, estimation and analysis. 

M - Multivariate case. 

O - Optimization of design parameters. 

P - Polytomous questions. 

Q - Quantitative variables. 

R - Randomization devices and techniques. 

S - Sample design. 

T - Theoretical developments. 

V - Validation studies. 

X - Expository papers. 
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Diagramme arborescent des résiduels fondés sur l’estimateur empirique de Bayes conditionnel (ceb) 
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